1. Problem Definition¶

Clearly defining the business problem or question to be solved. This ensures the project's objectives are aligned with organizational goals.

PROJECT 2
Exploratory analysis and predictive modeling of housing prices in Barcelona using KNIME, AutoML and Power BI

Objective¶

Expand the analysis and predictive modeling of housing prices in Barcelona using advanced tools such as KNIME for ETL and analysis, Power BI for interactive visualization, and AutoML tools as a Low-Code or No-Code Machine Learning platform. The goal is to improve the accuracy of the predictive model and provide interactive visualizations that facilitate decision making.

Problem Definition Consolidated Notes¶

  • Project for predictive modeling of housing prices in Barcelona
  • Project goal is to improve the accuracy of the predictive model and provide interactive visualizations
  • Data Science project will be developed following the Data Science Life Cycle (DSLC) framework

2. Data Collection¶

Gathering relevant data from various sources, such as databases, APIs, or external datasets, ensuring it supports the problem statement.

Data Description¶

  • price: The price of the real-state.
  • rooms: Number of rooms.
  • bathroom: Number of bathrooms.
  • lift: whether a building has an elevator (also known as a lift in some regions) or not
  • terrace: If it has a terrace or not.
  • square_meters: Number of square meters.
  • real_state: Kind of real-state.
  • neighborhood: Neighborhood
  • square_meters_price: Price of the square meter

Importing necessary libraries¶

In [4]:
import pandas as pd
import numpy as np

# To help with data visualization
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization
%matplotlib inline
sns.set_style('whitegrid') # set style for visualization

# To supress warnings
import warnings # ignore warnings
warnings.filterwarnings('ignore')

from scipy.stats import zscore

#normalizing
from sklearn.preprocessing import MinMaxScaler, StandardScaler, PolynomialFeatures # to scale the data

# modeling
import statsmodels.api as sm # adding a constant to the independent variables
from sklearn.model_selection import train_test_split # splitting data in train and test sets
from sklearn.preprocessing import PowerTransformer, StandardScaler # for normalization


from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.neural_network import MLPRegressor
import xgboost as xgb
import lightgbm as lgb
#import catboost as catb
# CatBoost is a fast, scalable, high performance gradient boosting on decision trees library.
# Used for ranking, classification, regression and other ML tasks
# COULDN'T BE TESTED ON THIS PROJECT DUE ISSUES ON SETUP

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

#To check multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor

# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# for validation
from sklearn.model_selection import cross_val_score, KFold, cross_validate

# Deploy
import joblib
import streamlit as st

import datetime
import os

Loading the Dataset¶

In [5]:
df=pd.read_csv('DATA_Barcelona_Fotocasa_HousingPrices_Augmented.csv')
  • Dataset provided by the academy

Web Scraping¶

  • The option of improving data using WebScraping was explored
  • Attempts at self-learning were reflected in python programs "scraper_fotocasa.ipynb" and "scraper_v2.ipnb"
  • Programs aimed to browse the fotocasa website and collect information following the format of the dataset provided by the academy.
  • The programs were not completed, they are not functional in any version. Development of the programs was stopped on academic recommendation.
  • Web scraping can raise legal and ethical considerations, especially if it involves accessing data without authorization or violating a website's terms of service.
  • The academy recommended requesting permission from the web portal before continuing with development.
  • The related links were read (https://www.fotocasa.es/es/politica-privacidad/p ; https://www.fotocasa.es/es/aviso-legal/cp ; https://www.fotocasa.es/es/aviso-legal/ln) and no explicit information was found regarding the authorization or prohibition of web scraping activities.
  • It is noted that the absence of explicit permission does not imply consent.
  • Permission was requested from the portal and a negative response was received:

  Hello Carlos,
  We are sorry that we cannot help you, since for privacy reasons we do not carry out this type of collaboration.
  Regards,
  ayuda@fotocasa.zendesk.com

Property types¶

  • As the problem aims to predict housing prices in Barcelona, a brief complementary information about property types in Spain is included as reference.
    • Studio (Estudio): Typically the smallest type of dwelling, a studio is a single open space that combines the living area, bedroom, and kitchen, with a separate bathroom. These are ideal for individuals or couples seeking a compact living space.
    • Attic (Ático): An attic refers to a top-floor apartment, often featuring sloped ceilings and sometimes including a terrace. The size can vary, but attics are generally larger than studios and may offer unique architectural features.
    • Apartment (Apartamento): In Spain, the term "apartamento" usually denotes a modest-sized dwelling, typically with one or two bedrooms. These are suitable for small families or individuals desiring separate living and sleeping areas.
    • Flat (Piso): The term "piso" is commonly used to describe larger residential units, often with multiple bedrooms and ample living space. Flats are prevalent in urban areas and cater to families or individuals seeking more spacious accommodations.

Data Collection Consolidated Notes¶

  • The project will consider the data provided by the academy
  • Web scraping involves automatically extracting data from websites, which can be subject to legal restrictions depending on the website's policies and applicable laws.
  • As the problem aims to predict housing prices in Barcelona, a brief complementary information about property types in Spain is included as reference.

3. Data Preparation¶

Cleaning, preprocessing, and organizing the data. This includes handling missing values, outliers, data transformations, and feature engineering

Data Overview¶

In [6]:
df.head() # preview a sample first 5 rows
Out[6]:
Unnamed: 0 price rooms bathroom lift terrace square_meters real_state neighborhood square_meters_price
0 0 750 3.0 1.0 True False 60.0 flat Horta- Guinardo 12.500000
1 1 770 2.0 1.0 True False 59.0 flat Sant Andreu 13.050847
2 2 1300 1.0 1.0 True True 30.0 flat Gràcia 43.333333
3 3 2800 1.0 1.0 True True 70.0 flat Ciutat Vella 40.000000
4 4 720 2.0 1.0 True False 44.0 flat Sant Andreu 16.363636
In [7]:
df.tail() # preview a sample last 5 rows
Out[7]:
Unnamed: 0 price rooms bathroom lift terrace square_meters real_state neighborhood square_meters_price
16371 16371 950 1.982 0.957 True False 60.701 flat Sarria-Sant Gervasi 13.174
16372 16372 825 1.086 0.961 True False 47.224 flat Eixample 14.893
16373 16373 1200 4.195 1.957 True False 116.100 flat Les Corts 10.746
16374 16374 1100 2.899 2.155 False False 57.805 flat Sant Martí NaN
16375 16375 850 2.127 1.024 True False 58.503 flat Eixample 15.390
In [8]:
df.sample(20) # preview a sample random n rows
Out[8]:
Unnamed: 0 price rooms bathroom lift terrace square_meters real_state neighborhood square_meters_price
12837 12837 2200 1.864 1.843 True False 95.872 flat Eixample 20.557000
11309 11309 1350 2.038 0.921 True True 56.194 attic Horta- Guinardo 20.583000
336 336 3400 5.000 4.000 True True 220.000 flat Sarria-Sant Gervasi 15.454545
10843 10843 660 2.019 1.098 False False 49.637 flat Eixample 12.951000
6477 6477 1800 3.000 2.000 True False 80.000 attic Sarria-Sant Gervasi 22.500000
7940 7940 1600 4.000 2.000 True False 111.000 flat Les Corts 14.414414
15192 15192 900 0.000 0.960 False False 45.971 study Eixample 19.102000
8822 8822 800 0.000 0.917 True False 29.380 NaN Ciutat Vella 26.013000
14010 14010 3868 1.941 1.835 False False 99.718 apartment Eixample 37.907000
8876 8876 1250 3.211 1.098 True False 78.160 flat Eixample 17.484000
2749 2749 1400 3.000 2.000 True True 82.000 flat Eixample 17.073171
9071 9071 695 1.926 0.987 False False 40.050 NaN Sarria-Sant Gervasi 17.034000
7021 7021 733 1.000 1.000 False False 55.000 flat Les Corts 13.327273
6836 6836 868 1.000 1.000 True False 41.000 flat Gràcia 21.170732
16169 16169 850 2.087 0.963 True False 67.412 flat Sarria-Sant Gervasi 14.360000
7334 7334 2072 1.000 1.000 False False 55.000 apartment Horta- Guinardo 37.672727
6196 6196 4173 3.000 2.000 True False 110.000 apartment Eixample 37.936364
8750 8750 800 NaN 0.910 True False 62.304 attic Sarria-Sant Gervasi 12.500000
13828 13828 1350 2.903 1.925 False False 110.805 flat Sarria-Sant Gervasi 12.520000
12728 12728 800 0.922 0.931 True True 48.394 flat Horta- Guinardo 16.281000
  • The variable 'Unnamed' represent index and should be deleted from data
  • Target variable for modeling is "price"
In [9]:
print("There are", df.shape[0], 'rows and', df.shape[1], "columns.") # number of observations and features
There are 16376 rows and 10 columns.
  • There are 16376 rows and 10 columns.
  • Project1 data had 8188 rows and 10 columns.
In [10]:
df.dtypes # data types
Out[10]:
Unnamed: 0               int64
price                    int64
rooms                  float64
bathroom               float64
lift                      bool
terrace                   bool
square_meters          float64
real_state              object
neighborhood            object
square_meters_price    float64
dtype: object
  • Data types are aligned with information, except variables 'rooms' and 'bathroom' being float and expected integer
In [11]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16376 entries, 0 to 16375
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           16376 non-null  int64  
 1   price                16376 non-null  int64  
 2   rooms                15966 non-null  float64
 3   bathroom             15989 non-null  float64
 4   lift                 16376 non-null  bool   
 5   terrace              16376 non-null  bool   
 6   square_meters        15968 non-null  float64
 7   real_state           15458 non-null  object 
 8   neighborhood         16376 non-null  object 
 9   square_meters_price  15937 non-null  float64
dtypes: bool(2), float64(4), int64(2), object(2)
memory usage: 1.0+ MB
  • There are missing data (NaN) on multiple variables
In [12]:
df.describe(include="all").T # statistical summary of the data.
Out[12]:
count unique top freq mean std min 25% 50% 75% max
Unnamed: 0 16376.0 NaN NaN NaN 8187.5 4727.488339 0.0 4093.75 8187.5 12281.25 16375.0
price 16376.0 NaN NaN NaN 1437.04586 1106.831419 320.0 875.0 1100.0 1514.0 15000.0
rooms 15966.0 NaN NaN NaN 2.421662 1.13863 0.0 1.884 2.111 3.0 10.754
bathroom 15989.0 NaN NaN NaN 1.504682 0.723192 0.9 1.0 1.037 2.0 8.0
lift 16376 2 True 11246 NaN NaN NaN NaN NaN NaN NaN
terrace 16376 2 False 12770 NaN NaN NaN NaN NaN NaN NaN
square_meters 15968.0 NaN NaN NaN 84.368874 47.486402 10.0 56.0855 72.748 95.0 679.0
real_state 15458 4 flat 12650 NaN NaN NaN NaN NaN NaN NaN
neighborhood 16376 10 Eixample 4795 NaN NaN NaN NaN NaN NaN NaN
square_meters_price 15937.0 NaN NaN NaN 17.73171 9.199731 4.549 12.777778 15.31 19.402 197.272
  • Units size goes from 10m2 to 679m2, with a mean of 84.36m2
  • Units prices goes from 320EUR to 15000EUR/month, with mean of 1437EUR/month
  • price range is assumed referred to monthly rent, so considered as EUR per month
  • Units prices by square meter goes from 4.549EUR/m2/month to 197.272EUR/m2/month, with mean of 17.73EUR/m2/month
  • There are units listed with cero rooms and 10.754 rooms
  • There are units with 0.9 bathroom
In [13]:
# Uniques
df.nunique() # Checking for number of variations in the data
Out[13]:
Unnamed: 0             16376
price                    889
rooms                   1995
bathroom                1015
lift                       2
terrace                    2
square_meters           7751
real_state                 4
neighborhood              10
square_meters_price     9122
dtype: int64
In [14]:
df.columns
Out[14]:
Index(['Unnamed: 0', 'price', 'rooms', 'bathroom', 'lift', 'terrace',
       'square_meters', 'real_state', 'neighborhood', 'square_meters_price'],
      dtype='object')
In [15]:
for i in ['rooms', 'bathroom', 'lift', 'terrace', 'real_state', 'neighborhood']: # Checking uniques
    print (i,": ",df[i].unique())
rooms :  [3.    2.    1.    ... 4.131 4.195 2.899]
bathroom :  [1.    2.    3.    ... 5.898 2.862 2.866]
lift :  [ True False]
terrace :  [False  True]
real_state :  ['flat' 'attic' nan 'apartment' 'study']
neighborhood :  ['Horta- Guinardo' 'Sant Andreu' 'Gràcia' 'Ciutat Vella'
 'Sarria-Sant Gervasi' 'Les Corts' 'Sant Martí' 'Eixample'
 'Sants-Montjuïc' 'Nou Barris']
In [16]:
# Uniques
cat_cols = df.select_dtypes(include=['category', 'object','bool']).columns.tolist()
for column in cat_cols:
    print(df[column].value_counts())
    print("-" * 50)
lift
True     11246
False     5130
Name: count, dtype: int64
--------------------------------------------------
terrace
False    12770
True      3606
Name: count, dtype: int64
--------------------------------------------------
real_state
flat         12650
apartment     1967
attic          633
study          208
Name: count, dtype: int64
--------------------------------------------------
neighborhood
Eixample               4795
Sarria-Sant Gervasi    2765
Ciutat Vella           2716
Gràcia                 1416
Sant Martí             1257
Sants-Montjuïc         1165
Les Corts              1045
Horta- Guinardo         638
Sant Andreu             368
Nou Barris              211
Name: count, dtype: int64
--------------------------------------------------
  • There are four types of real states being the most common "flat"
  • Most units do not have terrace
  • Most units do have lift
  • The neighborhood with largest unit count is "Eixample"
In [17]:
# Display all rows in pandas outputs
pd.set_option("display.max_rows", None)  # Set to None to show all rows

# Print the value counts of 'rooms'
print(df['rooms'].value_counts())
print("-" * 50)
print(df['bathroom'].value_counts())

# Optionally reset display settings (if needed later in the script)
pd.reset_option("display.max_rows")
rooms
2.000     2608
3.000     2461
1.000     1600
4.000     1061
0.000      399
5.000      232
6.000       28
0.910       19
0.981       15
1.029       14
0.945       14
0.917       14
1.973       13
1.012       13
0.938       13
1.076       13
1.804       12
1.872       12
0.937       12
0.956       12
3.046       12
1.954       12
1.087       12
2.119       12
2.056       12
1.046       12
2.004       12
1.904       12
1.015       12
2.126       12
0.947       12
1.882       12
1.840       12
2.122       11
1.095       11
1.897       11
3.017       11
3.237       11
1.008       11
1.884       11
2.169       11
2.153       11
2.156       11
2.147       11
0.914       11
1.072       11
0.990       11
1.042       11
2.123       11
1.898       11
2.196       11
1.981       11
0.926       11
1.092       11
1.056       11
1.952       11
2.072       11
0.921       11
2.030       10
1.088       10
1.865       10
2.192       10
1.917       10
0.970       10
2.068       10
1.051       10
0.971       10
0.919       10
2.084       10
1.069       10
1.948       10
1.082       10
2.027       10
1.021       10
2.152       10
3.292       10
2.160       10
2.019       10
1.949       10
0.932       10
2.011       10
1.887       10
1.080       10
1.940       10
0.934       10
2.168       10
2.089       10
1.877       10
2.985       10
0.987       10
2.107       10
1.048       10
1.823       10
1.099       10
0.902       10
0.950       10
1.960       10
0.995       10
2.913       10
0.973       10
1.083       10
2.137       10
1.043       10
1.824       10
1.098       10
1.079        9
2.157        9
1.027        9
0.948        9
2.148        9
2.049        9
2.124        9
2.959        9
0.965        9
2.103        9
1.024        9
2.106        9
1.070        9
0.998        9
1.035        9
1.932        9
0.906        9
2.038        9
1.017        9
0.904        9
0.992        9
0.943        9
1.855        9
2.082        9
2.023        9
1.926        9
1.936        9
2.032        9
1.883        9
1.868        9
1.837        9
1.041        9
0.961        9
1.016        9
1.996        9
1.967        9
1.867        9
3.090        9
2.145        9
2.026        9
2.101        9
1.841        9
1.902        9
0.968        9
1.084        9
0.994        9
1.931        8
1.007        8
1.990        8
1.060        8
2.130        8
3.037        8
1.854        8
2.131        8
1.835        8
2.076        8
1.875        8
0.927        8
0.983        8
0.930        8
2.075        8
2.115        8
2.796        8
1.803        8
3.065        8
1.064        8
1.002        8
1.011        8
1.953        8
0.916        8
2.748        8
1.966        8
1.014        8
1.074        8
2.060        8
3.097        8
1.019        8
3.073        8
1.833        8
3.284        8
2.007        8
2.161        8
2.081        8
2.723        8
0.996        8
2.034        8
0.925        8
2.945        8
3.157        8
0.911        8
2.858        8
0.969        8
2.817        8
1.971        8
1.010        8
1.836        8
2.028        8
1.068        8
1.958        8
1.089        8
2.708        8
2.163        8
3.081        8
0.942        8
1.974        8
0.975        8
1.832        8
2.777        8
2.195        8
1.876        8
2.046        8
0.952        8
2.979        8
0.922        8
0.940        8
3.039        8
2.150        8
1.033        8
2.154        8
2.093        8
1.853        8
2.102        8
1.901        8
1.888        8
2.045        8
1.045        8
1.944        8
2.170        8
0.931        8
0.957        8
1.905        8
0.955        8
2.722        8
1.961        7
2.051        7
1.822        7
2.857        7
2.043        7
0.903        7
1.957        7
0.933        7
2.139        7
2.158        7
2.999        7
2.118        7
2.167        7
2.978        7
0.989        7
1.026        7
3.164        7
0.966        7
3.226        7
1.863        7
2.977        7
3.222        7
1.075        7
0.980        7
0.964        7
2.191        7
1.864        7
1.030        7
2.090        7
2.175        7
0.959        7
2.188        7
1.847        7
3.297        7
0.958        7
2.725        7
3.055        7
2.035        7
0.913        7
1.922        7
2.931        7
2.794        7
2.149        7
1.989        7
1.927        7
2.903        7
2.077        7
1.992        7
1.061        7
1.977        7
2.848        7
3.271        7
2.050        7
2.129        7
0.915        7
3.078        7
2.003        7
2.846        7
1.073        7
1.893        7
2.906        7
2.033        7
2.198        7
2.116        7
1.965        7
2.080        7
1.978        7
1.895        7
0.972        7
1.984        7
0.908        7
1.044        7
1.845        7
2.132        7
2.952        7
3.154        7
0.954        7
1.993        7
2.031        7
1.852        7
2.938        7
3.253        7
0.912        7
2.870        7
1.851        7
1.055        7
1.006        7
2.924        7
0.924        7
1.808        7
2.184        7
1.071        7
2.048        7
2.044        7
1.930        7
1.968        7
0.900        7
2.117        7
2.136        7
2.173        7
1.821        7
1.090        7
1.900        7
2.001        7
3.019        7
2.738        7
2.092        7
3.283        7
2.059        7
3.249        7
2.087        6
2.159        6
1.096        6
1.093        6
2.937        6
2.114        6
2.715        6
2.882        6
0.939        6
0.991        6
2.165        6
2.143        6
1.939        6
2.815        6
1.830        6
3.242        6
1.850        6
2.095        6
2.021        6
2.710        6
1.031        6
2.070        6
2.073        6
2.934        6
3.227        6
1.941        6
2.015        6
2.957        6
3.035        6
2.933        6
3.075        6
3.113        6
1.916        6
3.193        6
2.199        6
3.076        6
2.872        6
1.985        6
2.134        6
3.229        6
2.037        6
2.727        6
2.025        6
3.214        6
1.022        6
2.171        6
1.081        6
2.125        6
1.003        6
2.066        6
3.186        6
0.976        6
2.008        6
3.200        6
1.995        6
1.947        6
3.184        6
2.871        6
1.059        6
2.190        6
2.078        6
2.020        6
2.873        6
1.870        6
3.232        6
2.058        6
2.009        6
3.011        6
3.295        6
2.146        6
2.717        6
3.028        6
2.981        6
2.916        6
1.911        6
3.060        6
1.983        6
1.050        6
1.918        6
2.901        6
2.879        6
3.063        6
3.111        6
0.982        6
2.897        6
3.199        6
1.814        6
2.110        6
0.979        6
2.052        6
2.874        6
1.912        6
1.987        6
1.086        6
1.924        6
1.020        6
3.258        6
3.285        6
1.805        6
0.999        6
2.724        6
1.970        6
3.062        6
3.250        6
1.935        6
2.702        6
3.102        6
1.058        6
0.918        6
2.111        6
1.857        6
2.094        6
3.212        6
3.106        6
0.978        6
2.926        6
0.923        6
1.025        6
2.109        6
0.993        6
2.914        6
2.186        6
2.920        6
1.878        6
0.909        6
1.807        6
2.909        6
2.865        6
2.818        6
2.967        6
0.944        6
3.276        6
1.077        6
1.998        6
2.057        6
3.101        6
1.903        6
0.920        6
1.963        6
2.753        6
0.974        6
3.032        6
2.884        6
2.179        6
2.144        6
1.834        6
3.187        6
1.909        6
1.001        6
2.140        6
2.141        6
2.984        6
2.755        6
0.946        6
3.167        6
1.925        6
1.946        6
1.915        6
2.987        6
2.771        5
3.079        5
3.004        5
2.013        5
2.843        5
3.027        5
2.786        5
1.950        5
3.231        5
1.910        5
2.894        5
3.206        5
2.844        5
2.182        5
2.811        5
3.020        5
3.235        5
1.886        5
3.260        5
3.010        5
3.275        5
1.034        5
1.933        5
1.818        5
3.183        5
2.836        5
1.032        5
2.164        5
1.856        5
2.955        5
3.122        5
3.191        5
2.054        5
3.262        5
3.131        5
3.224        5
3.100        5
1.980        5
3.080        5
3.054        5
1.879        5
1.067        5
1.085        5
3.072        5
3.266        5
2.964        5
2.759        5
1.053        5
2.183        5
4.313        5
3.259        5
1.844        5
2.071        5
2.024        5
1.094        5
2.973        5
3.192        5
2.742        5
2.041        5
2.121        5
2.187        5
1.066        5
1.962        5
2.714        5
3.074        5
3.086        5
3.013        5
3.103        5
1.091        5
1.810        5
3.149        5
1.866        5
3.208        5
1.859        5
2.826        5
1.999        5
1.052        5
0.967        5
3.096        5
1.956        5
1.848        5
2.172        5
2.974        5
3.105        5
2.181        5
1.040        5
1.826        5
1.889        5
3.050        5
2.837        5
3.196        5
2.734        5
4.073        5
2.104        5
1.945        5
2.042        5
2.713        5
1.921        5
2.197        5
1.063        5
2.902        5
3.967        5
1.896        5
1.827        5
2.047        5
4.148        5
2.740        5
1.801        5
0.985        5
2.892        5
2.016        5
3.077        5
2.155        5
1.817        5
1.815        5
3.296        5
2.820        5
3.233        5
2.002        5
2.930        5
3.132        5
2.994        5
1.018        5
4.248        5
0.953        5
0.960        5
2.766        5
3.007        5
2.765        5
2.856        5
3.637        5
0.986        5
4.061        5
4.187        5
1.065        5
2.128        5
3.291        5
3.142        5
2.904        5
2.828        5
0.907        5
3.139        5
3.091        5
3.255        5
2.950        5
2.757        5
2.830        5
1.057        5
2.086        5
4.380        5
1.880        5
4.080        5
4.040        5
2.835        5
2.743        5
3.153        5
1.861        5
3.716        5
3.228        5
2.810        5
3.201        5
2.797        5
3.158        5
2.961        5
2.113        5
2.782        5
1.037        5
2.813        5
2.732        5
2.962        5
3.254        5
2.839        5
1.890        5
2.018        5
1.934        5
2.996        5
3.023        5
2.887        5
1.994        5
2.863        5
2.142        5
2.911        5
2.166        5
2.105        5
0.977        5
1.825        5
2.718        5
3.202        5
1.988        5
0.988        5
3.216        5
3.832        5
0.997        5
1.862        5
3.094        5
1.873        5
1.054        5
2.063        5
3.272        5
0.929        5
2.085        5
3.236        5
2.949        5
3.095        5
7.000        5
2.006        4
3.163        4
3.156        4
1.806        4
3.175        4
2.960        4
3.256        4
1.846        4
3.058        4
2.896        4
1.943        4
2.709        4
2.733        4
2.970        4
2.854        4
1.891        4
3.210        4
2.980        4
3.197        4
2.919        4
4.038        4
3.114        4
2.180        4
2.735        4
2.941        4
2.079        4
2.022        4
2.991        4
1.907        4
2.069        4
1.005        4
3.119        4
3.261        4
2.728        4
2.193        4
2.814        4
1.991        4
1.858        4
1.062        4
2.138        4
3.069        4
3.179        4
3.979        4
3.067        4
3.066        4
3.022        4
2.889        4
2.726        4
2.133        4
3.223        4
1.894        4
3.088        4
3.002        4
3.273        4
2.739        4
2.040        4
3.247        4
2.014        4
3.279        4
3.949        4
3.152        4
1.811        4
2.784        4
2.189        4
3.246        4
1.843        4
3.150        4
2.780        4
2.791        4
2.886        4
3.136        4
4.310        4
2.730        4
2.940        4
2.763        4
1.951        4
3.051        4
1.885        4
2.816        4
3.252        4
2.792        4
2.838        4
0.963        4
4.162        4
3.180        4
1.871        4
3.278        4
1.942        4
3.143        4
3.036        4
2.012        4
3.815        4
1.800        4
2.790        4
4.003        4
1.009        4
3.209        4
2.809        4
3.173        4
2.731        4
1.047        4
2.898        4
1.036        4
1.829        4
2.928        4
3.006        4
1.923        4
3.159        4
3.198        4
3.708        4
2.769        4
3.146        4
3.241        4
0.941        4
3.181        4
2.135        4
2.869        4
2.888        4
2.751        4
4.273        4
3.093        4
3.274        4
3.195        4
2.711        4
3.047        4
2.921        4
1.955        4
3.194        4
3.294        4
2.963        4
2.064        4
3.213        4
1.928        4
3.085        4
1.831        4
2.747        4
2.127        4
2.067        4
2.841        4
3.161        4
1.820        4
2.120        4
3.280        4
2.055        4
2.062        4
2.065        4
2.877        4
3.731        4
2.893        4
4.356        4
1.969        4
3.830        4
2.767        4
0.984        4
2.083        4
3.238        4
2.875        4
3.188        4
3.127        4
3.029        4
2.754        4
1.899        4
3.057        4
2.761        4
2.029        4
3.108        4
2.825        4
2.900        4
3.838        4
4.096        4
1.938        4
1.100        4
1.964        4
1.812        4
3.185        4
3.263        4
3.061        4
3.290        4
2.750        4
1.920        4
3.008        4
3.162        4
3.130        4
1.937        4
2.942        4
0.901        4
2.885        4
3.071        4
1.838        4
4.338        4
2.864        4
2.746        4
1.842        4
2.800        4
3.629        4
2.108        4
2.005        4
3.087        4
3.230        4
3.024        4
1.906        4
3.003        4
3.041        4
2.799        4
0.962        4
3.125        4
2.783        4
3.021        4
2.929        4
1.869        4
1.049        4
2.074        3
2.091        3
3.082        3
3.958        3
0.935        3
2.925        3
2.958        3
1.982        3
1.004        3
2.822        3
3.245        3
2.922        3
4.105        3
3.805        3
3.044        3
3.664        3
4.394        3
2.908        3
2.174        3
2.772        3
4.670        3
3.655        3
3.888        3
2.098        3
1.849        3
4.054        3
1.802        3
3.092        3
2.868        3
3.141        3
3.268        3
3.083        3
3.070        3
3.243        3
3.178        3
3.204        3
0.949        3
4.391        3
4.304        3
3.687        3
2.721        3
3.277        3
2.716        3
1.972        3
3.211        3
2.736        3
3.609        3
3.016        3
3.172        3
3.014        3
2.983        3
2.850        3
3.915        3
3.667        3
4.363        3
3.649        3
3.607        3
2.997        3
3.138        3
2.010        3
2.833        3
3.084        3
2.831        3
1.828        3
3.975        3
2.847        3
2.749        3
2.932        3
4.081        3
4.299        3
1.913        3
2.805        3
3.009        3
2.917        3
1.979        3
3.680        3
2.927        3
3.795        3
3.155        3
2.756        3
2.185        3
4.296        3
3.723        3
4.223        3
2.993        3
3.905        3
2.855        3
4.287        3
2.982        3
2.876        3
4.058        3
2.770        3
5.415        3
4.378        3
2.781        3
2.764        3
4.359        3
2.968        3
2.861        3
3.281        3
1.819        3
3.299        3
1.028        3
5.087        3
3.955        3
4.044        3
1.914        3
3.205        3
4.360        3
4.279        3
4.259        3
4.126        3
3.165        3
2.017        3
2.776        3
3.110        3
3.909        3
1.908        3
0.905        3
2.972        3
4.158        3
2.151        3
3.928        3
2.812        3
2.706        3
3.219        3
2.752        3
3.112        3
3.792        3
1.816        3
2.944        3
2.849        3
0.928        3
3.207        3
3.147        3
4.222        3
2.939        3
3.662        3
2.976        3
3.251        3
3.064        3
2.969        3
2.700        3
2.096        3
2.878        3
2.744        3
2.804        3
3.040        3
3.215        3
3.846        3
1.839        3
3.052        3
4.196        3
2.853        3
2.956        3
2.948        3
2.907        3
4.007        3
3.614        3
3.601        3
3.748        3
3.117        3
3.177        3
1.809        3
3.782        3
3.863        3
4.163        3
2.834        3
3.166        3
3.767        3
4.213        3
2.807        3
4.119        3
3.056        3
4.286        3
2.827        3
3.098        3
3.856        3
3.732        3
3.881        3
2.954        3
4.152        3
3.761        3
3.099        3
3.123        3
3.999        3
3.118        3
2.803        3
3.289        3
3.220        3
2.720        3
2.775        3
1.997        3
2.703        3
2.793        3
2.832        3
3.160        3
2.036        3
2.852        3
3.819        3
3.617        3
4.351        3
2.866        3
2.701        3
3.068        3
3.724        3
2.719        3
10.000       3
4.229        2
3.984        2
3.605        2
3.034        2
3.781        2
4.228        2
0.936        2
4.251        2
2.053        2
4.076        2
2.762        2
3.957        2
2.704        2
4.122        2
5.074        2
4.398        2
4.161        2
3.170        2
3.115        2
3.286        2
3.966        2
3.135        2
3.282        2
4.183        2
3.244        2
3.824        2
3.686        2
3.704        2
4.064        2
3.656        2
3.992        2
2.802        2
4.146        2
2.840        2
3.203        2
3.218        2
4.303        2
3.978        2
3.134        2
4.362        2
3.705        2
2.912        2
3.168        2
2.867        2
4.810        2
2.789        2
4.182        2
3.720        2
4.072        2
4.254        2
3.861        2
3.743        2
5.253        2
4.349        2
2.176        2
4.290        2
3.852        2
3.045        2
3.293        2
2.162        2
2.760        2
4.379        2
3.671        2
2.712        2
4.168        2
4.393        2
5.322        2
4.057        2
4.366        2
3.107        2
1.860        2
3.911        2
2.774        2
3.913        2
4.373        2
3.778        2
2.741        2
3.640        2
3.176        2
1.813        2
2.859        2
4.384        2
3.947        2
4.325        2
4.046        2
2.881        2
3.855        2
4.330        2
3.126        2
4.571        2
4.245        2
4.367        2
3.288        2
5.481        2
4.333        2
3.902        2
4.218        2
4.256        2
4.166        2
3.663        2
4.347        2
3.264        2
2.177        2
3.755        2
4.336        2
5.213        2
4.956        2
3.217        2
1.078        2
2.745        2
3.983        2
4.142        2
4.050        2
4.197        2
3.987        2
4.371        2
3.038        2
3.621        2
4.019        2
4.189        2
3.613        2
1.013        2
4.275        2
2.986        2
3.137        2
3.239        2
3.681        2
3.602        2
3.623        2
4.220        2
4.345        2
4.226        2
1.975        2
1.039        2
3.742        2
4.006        2
3.684        2
3.835        2
1.874        2
3.620        2
3.953        2
3.943        2
4.098        2
4.186        2
3.802        2
4.315        2
4.532        2
3.903        2
3.683        2
4.206        2
4.292        2
3.690        2
4.285        2
4.323        2
3.744        2
2.845        2
4.368        2
3.803        2
3.033        2
4.215        2
4.377        2
3.969        2
3.770        2
3.647        2
2.737        2
3.145        2
3.053        2
3.124        2
3.875        2
1.023        2
4.095        2
3.866        2
2.808        2
3.696        2
3.148        2
3.018        2
2.795        2
2.953        2
4.022        2
3.851        2
4.388        2
3.882        2
3.673        2
3.182        2
3.287        2
4.099        2
4.780        2
3.026        2
3.896        2
3.658        2
3.632        2
4.634        2
3.043        2
1.986        2
4.132        2
4.386        2
3.715        2
1.919        2
3.843        2
3.025        2
3.880        2
2.935        2
3.012        2
2.088        2
1.881        2
1.892        2
4.023        2
3.174        2
4.138        2
2.998        2
2.779        2
4.324        2
4.083        2
3.864        2
3.825        2
3.257        2
3.248        2
3.760        2
4.357        2
4.309        2
4.234        2
3.234        2
4.260        2
3.885        2
3.922        2
3.627        2
3.189        2
4.372        2
4.249        2
3.920        2
4.387        2
3.737        2
2.966        2
3.701        2
4.043        2
4.334        2
3.651        2
3.804        2
3.991        2
3.059        2
4.326        2
3.005        2
3.747        2
3.190        2
3.871        2
4.160        2
1.097        2
3.269        2
2.891        2
5.323        2
5.572        2
4.062        2
3.956        2
2.039        2
2.788        2
3.048        2
2.099        2
3.709        2
3.710        2
3.910        2
2.975        2
4.257        2
3.862        2
2.883        2
4.028        2
4.321        2
4.077        2
4.100        2
2.821        2
4.107        2
4.177        2
2.946        2
4.375        2
3.972        2
2.851        2
4.952        2
2.778        2
3.144        2
4.116        2
3.714        2
4.130        2
4.133        2
2.178        2
4.317        2
3.089        2
3.847        2
3.700        2
4.266        2
3.891        2
4.179        2
4.243        2
2.785        2
3.780        2
4.277        2
3.840        2
4.335        2
2.918        2
1.038        2
3.133        2
3.766        2
3.616        2
4.075        2
2.097        2
3.932        2
3.151        2
3.877        2
3.808        2
4.176        2
9.000        2
3.934        1
5.465        1
4.365        1
3.729        1
3.652        1
5.410        1
5.405        1
3.670        1
3.725        1
3.952        1
4.801        1
4.298        1
4.799        1
4.355        1
4.537        1
3.998        1
5.194        1
4.297        1
3.650        1
3.789        1
4.265        1
4.399        1
4.250        1
4.276        1
4.701        1
10.748       1
4.036        1
3.945        1
4.856        1
5.457        1
4.065        1
3.854        1
3.907        1
4.239        1
4.554        1
6.591        1
4.089        1
5.361        1
3.677        1
4.764        1
5.437        1
4.192        1
2.899        1
4.822        1
5.426        1
3.996        1
5.023        1
5.126        1
4.278        1
4.703        1
4.400        1
3.796        1
5.209        1
3.842        1
4.293        1
4.030        1
4.282        1
4.833        1
5.448        1
5.182        1
4.602        1
5.371        1
3.849        1
3.001        1
2.100        1
3.695        1
4.085        1
4.082        1
4.270        1
2.915        1
4.211        1
4.014        1
5.463        1
3.887        1
5.042        1
4.272        1
2.829        1
2.890        1
5.344        1
3.853        1
3.628        1
5.040        1
4.093        1
4.396        1
4.052        1
2.729        1
2.819        1
5.416        1
3.857        1
3.049        1
4.124        1
5.498        1
3.785        1
4.390        1
5.027        1
4.267        1
4.230        1
4.198        1
5.447        1
3.884        1
4.934        1
4.008        1
4.055        1
4.902        1
4.760        1
4.661        1
7.537        1
3.734        1
4.894        1
2.880        1
4.322        1
4.047        1
3.679        1
3.610        1
4.893        1
4.233        1
4.965        1
5.380        1
4.091        1
4.973        1
4.084        1
4.048        1
4.385        1
5.499        1
4.051        1
5.064        1
3.817        1
5.348        1
3.121        1
4.170        1
4.106        1
5.188        1
3.721        1
4.820        1
4.311        1
3.768        1
5.606        1
4.560        1
3.806        1
4.327        1
4.316        1
3.784        1
2.801        1
3.931        1
3.031        1
4.004        1
3.820        1
5.454        1
5.483        1
3.722        1
4.157        1
3.645        1
5.331        1
5.055        1
4.920        1
4.709        1
5.298        1
3.799        1
3.642        1
5.176        1
6.052        1
3.995        1
4.214        1
3.989        1
4.025        1
4.031        1
4.755        1
5.411        1
5.233        1
3.937        1
4.208        1
4.041        1
5.986        1
4.566        1
5.081        1
3.698        1
9.384        1
3.691        1
3.900        1
4.824        1
4.188        1
4.342        1
3.865        1
4.814        1
3.657        1
2.806        1
4.518        1
3.120        1
3.753        1
3.844        1
3.794        1
3.917        1
5.283        1
4.063        1
3.941        1
4.302        1
4.108        1
4.174        1
4.971        1
4.002        1
3.756        1
3.726        1
4.102        1
5.474        1
5.201        1
3.965        1
4.202        1
4.834        1
4.224        1
5.490        1
5.472        1
5.145        1
4.540        1
5.187        1
5.249        1
4.049        1
3.659        1
4.090        1
2.971        1
3.638        1
4.887        1
2.989        1
3.697        1
4.933        1
5.496        1
5.114        1
3.221        1
4.060        1
3.839        1
5.116        1
3.611        1
3.810        1
5.433        1
3.894        1
3.618        1
4.252        1
4.301        1
4.861        1
4.575        1
4.648        1
4.284        1
4.005        1
5.180        1
4.295        1
6.085        1
5.445        1
4.855        1
3.717        1
4.225        1
5.456        1
4.204        1
2.842        1
4.115        1
4.034        1
4.361        1
5.197        1
2.905        1
2.965        1
4.056        1
4.876        1
5.338        1
3.694        1
4.094        1
5.039        1
2.112        1
5.263        1
2.705        1
4.963        1
4.237        1
4.147        1
4.529        1
3.893        1
5.029        1
4.555        1
2.823        1
2.947        1
5.395        1
4.369        1
5.358        1
3.267        1
4.210        1
5.475        1
5.054        1
4.918        1
4.839        1
4.118        1
4.509        1
5.291        1
3.926        1
3.988        1
2.943        1
2.773        1
2.910        1
3.169        1
4.508        1
4.009        1
3.827        1
3.823        1
4.392        1
4.010        1
5.387        1
4.354        1
3.936        1
4.244        1
4.558        1
4.875        1
4.524        1
4.382        1
4.967        1
1.976        1
5.109        1
3.757        1
3.030        1
4.344        1
4.281        1
3.831        1
4.813        1
2.936        1
4.017        1
4.358        1
4.931        1
5.723        1
4.070        1
5.079        1
4.242        1
5.339        1
4.209        1
4.707        1
3.689        1
5.230        1
3.718        1
3.758        1
5.392        1
3.901        1
3.921        1
7.580        1
3.890        1
3.811        1
4.205        1
5.276        1
4.078        1
4.159        1
5.214        1
4.140        1
4.600        1
4.657        1
4.923        1
3.883        1
5.346        1
4.153        1
4.201        1
4.597        1
3.752        1
4.156        1
4.339        1
2.768        1
5.450        1
5.402        1
4.175        1
6.307        1
3.964        1
3.974        1
4.212        1
4.927        1
4.033        1
4.995        1
3.930        1
3.042        1
3.923        1
4.941        1
3.836        1
4.919        1
5.458        1
4.542        1
9.543        1
3.745        1
4.653        1
4.232        1
4.015        1
4.066        1
7.623        1
2.758        1
3.639        1
5.337        1
4.121        1
2.992        1
3.980        1
4.736        1
3.981        1
2.194        1
4.253        1
3.924        1
3.990        1
3.982        1
3.813        1
2.707        1
3.939        1
5.130        1
4.348        1
4.943        1
3.740        1
4.885        1
4.850        1
5.258        1
3.728        1
3.942        1
3.738        1
4.832        1
4.269        1
3.225        1
2.988        1
3.954        1
3.265        1
4.236        1
3.641        1
4.370        1
5.562        1
3.685        1
3.899        1
4.001        1
3.872        1
3.713        1
5.855        1
4.203        1
5.147        1
4.181        1
3.654        1
3.736        1
4.672        1
4.169        1
3.661        1
6.480        1
3.682        1
5.407        1
3.128        1
4.012        1
4.730        1
2.995        1
3.772        1
4.172        1
3.951        1
4.113        1
4.011        1
4.086        1
4.180        1
4.667        1
4.231        1
3.666        1
4.131        1
2.862        1
3.600        1
5.497        1
4.219        1
3.606        1
4.759        1
3.833        1
5.742        1
5.060        1
0.951        1
4.216        1
3.829        1
3.944        1
4.389        1
3.624        1
5.381        1
4.680        1
4.271        1
3.612        1
3.298        1
4.088        1
3.634        1
4.319        1
3.916        1
2.787        1
4.695        1
5.241        1
5.123        1
5.299        1
4.145        1
4.150        1
4.757        1
3.625        1
3.828        1
4.970        1
5.135        1
3.678        1
3.699        1
4.241        1
4.069        1
4.280        1
4.068        1
6.129        1
4.383        1
4.506        1
4.207        1
2.061        1
3.986        1
4.307        1
5.008        1
4.134        1
2.990        1
3.997        1
3.858        1
3.814        1
5.293        1
3.240        1
4.936        1
4.395        1
3.870        1
3.873        1
4.892        1
5.317        1
3.809        1
4.802        1
5.225        1
4.238        1
4.178        1
1.959        1
3.773        1
3.961        1
6.492        1
4.129        1
5.565        1
4.574        1
4.092        1
2.824        1
4.318        1
5.030        1
5.066        1
4.320        1
4.079        1
4.135        1
4.938        1
3.962        1
4.721        1
4.042        1
3.889        1
5.129        1
6.788        1
4.195        1
4.869        1
2.951        1
4.247        1
3.646        1
4.185        1
3.897        1
3.868        1
4.026        1
4.097        1
3.993        1
3.807        1
2.895        1
3.816        1
10.754       1
7.313        1
Name: count, dtype: int64
--------------------------------------------------
bathroom
1.000    4873
2.000    2742
3.000     421
4.000     121
5.000      41
1.094      36
0.986      35
1.051      35
0.939      34
1.060      34
1.079      34
1.058      33
1.059      33
0.966      32
0.935      32
1.010      32
0.981      32
1.022      32
0.991      32
0.995      31
1.037      31
0.922      31
1.044      31
0.958      30
0.933      30
1.032      30
1.004      29
0.903      29
1.098      29
1.045      29
1.053      29
1.068      29
0.999      29
0.959      29
0.901      29
1.075      29
1.076      29
0.945      29
1.074      28
0.950      28
1.018      28
1.043      28
0.974      28
0.944      27
0.930      27
1.061      27
0.942      27
1.085      27
0.980      27
1.092      27
1.049      27
0.961      27
0.976      27
1.020      27
0.962      26
1.015      26
0.924      26
1.089      26
1.041      26
0.911      26
0.994      26
0.953      26
0.919      26
0.905      26
0.955      26
1.008      26
1.029      26
0.934      26
0.947      26
0.996      25
1.066      25
0.948      25
1.091      25
0.912      25
0.998      25
1.050      25
1.021      25
0.926      25
1.026      25
0.975      25
0.973      24
1.047      24
1.093      24
1.023      24
1.080      24
1.003      24
0.907      24
1.017      24
1.038      24
1.099      24
1.070      24
1.056      24
0.967      24
1.025      24
0.906      24
1.009      24
0.931      24
1.035      24
0.964      24
1.046      24
1.078      24
1.086      23
1.069      23
1.013      23
0.972      23
0.923      23
0.984      23
0.940      23
1.002      23
0.946      23
1.097      23
0.963      23
0.978      23
1.024      23
0.929      22
0.943      22
1.065      22
0.965      22
1.087      22
1.036      22
0.949      22
0.993      22
0.985      22
1.033      22
1.030      21
0.927      21
0.990      21
0.917      21
0.983      21
0.968      21
0.916      21
1.001      21
0.957      21
0.928      21
1.012      21
0.938      21
0.997      21
1.088      21
0.925      21
0.913      20
0.970      20
1.019      20
1.014      20
1.031      20
1.071      20
1.005      20
0.936      20
1.054      20
1.081      20
1.073      20
0.960      20
1.062      20
0.954      20
0.909      19
1.007      19
1.067      19
0.900      19
0.952      19
1.072      19
1.052      19
1.063      19
0.979      19
0.989      19
1.042      19
0.982      18
1.039      18
0.932      18
0.914      18
1.077      18
1.048      18
1.090      18
1.095      18
1.083      18
1.064      18
0.921      18
0.971      18
0.918      18
1.027      17
0.987      17
0.908      17
0.941      17
0.969      17
1.040      17
1.084      16
0.951      16
0.904      16
1.057      16
1.034      16
1.028      16
0.937      16
1.096      15
0.988      15
1.006      15
0.902      15
1.055      15
0.956      15
1.082      14
0.977      14
2.051      14
0.920      14
2.125      14
2.183      14
1.995      14
0.992      14
2.088      14
1.849      13
0.910      13
1.957      13
1.943      13
2.169      13
1.973      13
1.933      12
2.043      12
2.075      12
2.173      12
2.145      12
2.115      12
1.980      12
1.801      12
1.011      12
1.986      12
1.802      11
1.861      11
1.951      11
1.962      11
2.070      11
2.002      11
1.983      11
1.016      11
2.127      11
2.082      11
2.103      11
1.870      11
1.975      11
2.155      11
1.998      10
1.901      10
1.914      10
1.907      10
2.157      10
2.008      10
1.968      10
1.965      10
2.143      10
2.038      10
2.077      10
1.915      10
2.122      10
1.823      10
1.881      10
1.868      10
1.879      10
2.129      10
2.175      10
1.885      10
1.859      10
1.889      10
1.961      10
0.915      10
1.903       9
6.000       9
1.860       9
1.905       9
1.949       9
1.920       9
1.840       9
1.865       9
1.900       9
1.936       9
1.892       9
2.181       9
1.819       9
2.091       9
2.080       9
2.049       9
1.896       9
2.072       9
2.165       9
1.844       9
1.994       9
1.852       9
2.137       9
1.843       9
1.977       9
2.171       9
1.846       9
2.140       9
2.142       9
2.113       9
2.161       9
1.864       9
1.834       9
1.805       9
1.100       9
1.992       9
1.813       9
2.139       9
2.074       8
1.818       8
1.947       8
2.066       8
1.945       8
1.929       8
1.891       8
2.154       8
2.031       8
2.135       8
2.076       8
2.054       8
1.871       8
2.105       8
1.897       8
2.124       8
1.836       8
2.014       8
2.068       8
1.875       8
1.878       8
2.177       8
2.019       8
2.022       8
2.048       8
2.111       8
2.184       8
2.093       8
1.893       8
2.192       8
1.910       8
2.001       8
2.100       8
1.988       8
2.166       8
2.034       8
2.046       8
2.098       8
1.841       8
1.987       8
1.810       8
2.193       8
1.886       8
2.097       8
2.196       8
1.950       8
2.032       8
1.908       8
1.985       8
2.029       8
1.997       8
2.156       8
1.815       8
1.921       7
1.982       7
2.109       7
1.817       7
2.164       7
2.021       7
2.004       7
1.874       7
1.984       7
1.991       7
1.906       7
1.829       7
1.873       7
2.087       7
1.926       7
1.806       7
1.918       7
2.007       7
1.931       7
2.017       7
2.035       7
1.970       7
2.050       7
2.033       7
2.052       7
2.090       7
1.820       7
1.974       7
2.106       7
1.922       7
2.024       7
2.110       7
1.960       7
2.079       7
2.078       7
2.061       7
2.144       7
2.138       7
2.040       7
1.867       7
1.953       7
2.003       7
1.902       7
1.894       7
1.972       7
1.850       7
2.084       7
2.036       7
1.971       7
1.938       7
1.940       7
2.067       7
2.005       7
1.827       7
2.015       7
2.185       7
1.913       7
2.053       7
2.095       7
2.189       7
1.999       7
1.851       7
1.927       6
1.952       6
2.168       6
2.085       6
2.042       6
2.112       6
1.863       6
2.028       6
1.993       6
1.979       6
2.141       6
2.114       6
1.812       6
2.006       6
2.170       6
2.041       6
1.830       6
1.948       6
1.944       6
2.025       6
1.966       6
2.134       6
2.148       6
1.862       6
1.912       6
1.963       6
2.195       6
2.188       6
2.086       6
2.191       6
1.990       6
2.064       6
1.821       6
1.925       6
1.939       6
2.136       6
2.149       6
2.187       6
1.942       6
2.117       6
2.162       6
1.909       6
1.837       6
2.198       6
2.083       6
1.967       6
1.930       6
1.877       6
1.978       6
1.856       6
1.954       6
2.179       6
2.178       6
2.126       6
1.832       5
1.826       5
2.116       5
2.121       5
2.146       5
2.159       5
1.923       5
1.941       5
2.047       5
2.147       5
2.039       5
2.060       5
2.128       5
1.816       5
2.094       5
1.989       5
2.099       5
2.120       5
1.883       5
1.946       5
2.071       5
1.855       5
2.107       5
2.058       5
2.167       5
1.811       5
1.959       5
1.899       5
2.018       5
2.009       5
2.118       5
2.065       5
2.153       5
1.898       5
2.123       5
2.062       5
2.023       5
1.858       5
1.854       5
2.133       5
1.934       5
2.197       5
2.089       5
1.911       5
2.151       5
1.831       5
1.904       5
1.824       5
2.081       5
1.822       5
1.955       5
1.996       4
2.108       4
2.063       4
1.917       4
2.092       4
1.857       4
2.069       4
2.104       4
2.160       4
2.132       4
1.887       4
2.174       4
1.872       4
2.013       4
2.130       4
2.044       4
1.919       4
1.935       4
1.964       4
2.797       4
1.814       4
1.845       4
1.808       4
2.073       4
2.055       4
1.976       4
1.956       4
2.172       4
2.030       4
1.916       4
1.848       4
2.150       4
2.176       4
2.194       4
1.937       4
2.180       4
2.186       4
2.016       4
2.026       4
1.890       4
2.182       4
1.838       4
3.249       3
1.932       3
1.958       3
2.056       3
1.853       3
2.059       3
1.969       3
2.131       3
1.880       3
2.944       3
1.804       3
1.928       3
1.876       3
2.119       3
2.785       3
2.102       3
2.027       3
1.888       3
1.842       3
1.809       3
3.173       3
1.882       3
1.803       3
1.884       3
3.078       3
2.163       3
2.010       3
3.201       3
3.298       3
2.199       3
2.818       3
2.711       3
4.393       3
3.026       3
2.722       3
2.907       3
1.828       3
3.074       3
2.841       3
1.895       3
3.004       2
2.045       2
1.847       2
2.784       2
3.234       2
2.827       2
3.245       2
1.839       2
2.152       2
1.833       2
3.143       2
3.079       2
3.256       2
3.204       2
3.067       2
2.920       2
1.866       2
2.851       2
3.138       2
2.706       2
2.897       2
2.838       2
3.154       2
3.039       2
3.005       2
1.869       2
3.244       2
3.278       2
2.895       2
3.162       2
2.020       2
2.703       2
3.085       2
3.262       2
3.947       2
3.166       2
5.299       2
2.756       2
2.928       2
3.252       2
2.731       2
2.906       2
2.739       2
3.012       2
3.214       2
1.825       2
3.184       2
3.215       2
3.264       2
4.309       2
2.983       2
2.707       2
2.964       2
3.002       2
2.096       2
3.149       2
1.981       2
3.259       2
2.747       2
3.192       2
2.767       2
2.868       2
2.878       2
3.086       2
2.200       2
2.190       2
3.721       2
3.093       2
3.261       2
2.986       2
3.869       2
3.095       2
3.970       2
2.720       2
3.190       2
3.076       2
2.742       2
2.995       2
3.241       2
2.876       2
2.993       2
2.057       2
2.957       2
2.101       2
2.792       2
5.093       2
2.721       2
2.751       2
2.857       2
3.161       2
2.898       2
3.101       2
3.009       2
2.910       2
8.000       2
2.866       1
3.041       1
3.187       1
4.903       1
3.061       1
3.228       1
3.232       1
2.976       1
2.959       1
2.904       1
3.703       1
3.224       1
5.908       1
3.139       1
3.081       1
2.829       1
4.679       1
3.863       1
2.824       1
4.252       1
4.914       1
4.325       1
4.255       1
3.052       1
3.275       1
4.502       1
2.914       1
3.146       1
3.054       1
3.976       1
3.855       1
4.208       1
3.040       1
4.374       1
2.913       1
2.886       1
2.991       1
4.841       1
3.027       1
4.746       1
3.010       1
2.852       1
4.900       1
2.912       1
4.295       1
3.025       1
3.627       1
3.297       1
3.719       1
3.294       1
3.219       1
4.158       1
3.021       1
2.885       1
2.916       1
3.803       1
3.251       1
4.233       1
2.884       1
2.863       1
1.924       1
3.051       1
2.814       1
3.769       1
3.231       1
4.203       1
2.940       1
3.140       1
1.807       1
3.107       1
5.857       1
4.745       1
4.370       1
4.132       1
4.340       1
5.717       1
3.885       1
5.393       1
3.178       1
3.175       1
4.929       1
2.861       1
4.855       1
3.072       1
3.152       1
4.119       1
4.854       1
3.705       1
2.773       1
2.977       1
3.133       1
3.290       1
2.799       1
3.640       1
3.601       1
2.781       1
4.326       1
3.287       1
2.816       1
3.058       1
2.771       1
2.975       1
3.242       1
5.490       1
3.642       1
4.112       1
4.353       1
2.011       1
5.384       1
5.263       1
2.758       1
2.809       1
3.267       1
2.905       1
3.008       1
3.937       1
3.070       1
2.709       1
3.125       1
2.888       1
4.689       1
3.695       1
2.037       1
3.254       1
3.213       1
3.129       1
3.263       1
4.701       1
2.718       1
3.221       1
4.044       1
3.293       1
1.800       1
2.911       1
3.189       1
4.285       1
2.821       1
2.736       1
4.139       1
5.251       1
2.840       1
5.759       1
4.754       1
2.755       1
3.230       1
3.206       1
2.754       1
3.202       1
2.871       1
3.239       1
2.778       1
4.140       1
3.177       1
4.198       1
4.055       1
2.733       1
3.113       1
3.073       1
5.366       1
4.323       1
3.952       1
2.936       1
3.094       1
5.474       1
3.104       1
5.063       1
3.159       1
4.668       1
2.958       1
2.880       1
2.930       1
2.815       1
3.281       1
3.216       1
4.789       1
2.903       1
2.808       1
3.276       1
2.925       1
2.812       1
2.853       1
4.028       1
4.266       1
3.015       1
3.277       1
3.218       1
3.034       1
5.431       1
2.973       1
3.176       1
2.856       1
3.050       1
2.788       1
2.947       1
3.222       1
2.854       1
4.659       1
3.257       1
4.964       1
5.322       1
2.990       1
3.077       1
3.273       1
3.265       1
3.236       1
3.200       1
3.111       1
4.343       1
4.279       1
2.867       1
4.025       1
4.826       1
3.209       1
2.732       1
3.698       1
5.195       1
4.333       1
3.155       1
4.050       1
4.011       1
4.695       1
4.116       1
3.128       1
4.312       1
3.291       1
3.959       1
2.847       1
3.296       1
2.830       1
5.218       1
2.719       1
4.141       1
3.185       1
2.780       1
2.762       1
3.150       1
3.199       1
2.819       1
2.804       1
2.921       1
2.887       1
4.899       1
2.803       1
3.836       1
2.761       1
3.829       1
7.160       1
2.994       1
4.773       1
3.614       1
3.007       1
5.444       1
3.762       1
3.188       1
3.044       1
3.602       1
4.377       1
3.725       1
4.371       1
3.655       1
2.806       1
4.335       1
3.279       1
3.123       1
2.848       1
3.103       1
2.926       1
1.835       1
3.283       1
2.765       1
4.948       1
2.890       1
4.297       1
4.113       1
2.813       1
2.823       1
3.271       1
4.797       1
3.047       1
2.894       1
2.774       1
3.135       1
4.400       1
2.850       1
2.702       1
3.068       1
2.846       1
3.205       1
2.929       1
7.000       1
2.862       1
5.898       1
3.046       1
2.946       1
2.787       1
2.704       1
3.087       1
3.300       1
3.119       1
4.690       1
2.701       1
2.791       1
4.105       1
3.953       1
4.196       1
3.182       1
3.780       1
2.012       1
3.255       1
3.083       1
3.975       1
4.204       1
2.158       1
2.724       1
3.246       1
2.826       1
2.743       1
4.321       1
4.064       1
3.053       1
2.828       1
Name: count, dtype: int64
  • the variable 'rooms' will require feature engineering
  • the variable 'bathroom' will require feature engineering
In [18]:
room_counts_list = []
# Iterate through each integer value of rooms
for i in range(1, 1 + int(max(df['rooms']))):
    count = df['rooms'][df['rooms'] == i].count()  # Count occurrences for the current value
    room_counts_list.append({'rooms': i, 'count': count})  # Add result to the list

# Convert the list of dictionaries into a DataFrame
room_counts = pd.DataFrame(room_counts_list)

#calculate totals
int_rooms=room_counts['count'].sum()

room_counts['int_prop']=room_counts['count']/int_rooms
room_counts['net_prop']=room_counts['count']/16376
room_counts
Out[18]:
rooms count int_prop net_prop
0 1 1600 0.200000 0.097704
1 2 2608 0.326000 0.159257
2 3 2461 0.307625 0.150281
3 4 1061 0.132625 0.064790
4 5 232 0.029000 0.014167
5 6 28 0.003500 0.001710
6 7 5 0.000625 0.000305
7 8 0 0.000000 0.000000
8 9 2 0.000250 0.000122
9 10 3 0.000375 0.000183
In [19]:
print(f'The total number of observations with an integer number for variable "rooms" is {room_counts['count'].sum()}, this represents {room_counts['net_prop'].sum()*100:.2f}% of total observations')
The total number of observations with an integer number for variable "rooms" is 8000, this represents 48.85% of total observations
In [20]:
bathroom_counts_list = []
# Iterate through each integer value of rooms
for i in range(1, 1 + int(max(df['bathroom']))):
    count = df['bathroom'][df['bathroom'] == i].count()  # Count occurrences for the current value
    bathroom_counts_list.append({'bathroom': i, 'count': count})  # Add result to the list

# Convert the list of dictionaries into a DataFrame
bathroom_counts = pd.DataFrame(bathroom_counts_list)

#calculate totals
int_bathroom=bathroom_counts['count'].sum()

bathroom_counts['int_prop']=bathroom_counts['count']/int_rooms
bathroom_counts['net_prop']=bathroom_counts['count']/16376
bathroom_counts
Out[20]:
bathroom count int_prop net_prop
0 1 4873 0.609125 0.297570
1 2 2742 0.342750 0.167440
2 3 421 0.052625 0.025708
3 4 121 0.015125 0.007389
4 5 41 0.005125 0.002504
5 6 9 0.001125 0.000550
6 7 1 0.000125 0.000061
7 8 2 0.000250 0.000122
In [21]:
print(f'The total number of observations with an integer number for variable "bathroom" is {bathroom_counts['count'].sum()}, this represents {bathroom_counts['net_prop'].sum()*100:.2f}% of total observations')
The total number of observations with an integer number for variable "bathroom" is 8210, this represents 50.13% of total observations
  • Considering high proportion of invalid values (not integer) on variables 'rooms' and 'bathroom' (51.15% and 49.87%), and Project2 dataset is stated as an augmented version of Project1 dataset, is interpreted Project1 dataframe was augmented adding artificial data to make it larger, and in this process of Data Augmentation those observations with decimal values where not corrected to integers in Project2 dataset.
In [22]:
# Duplicates
print(df.duplicated().sum()) # Checking for duplicate entries in the data
0
  • There are no duplicated observations

Missing Value handling¶

In [23]:
df2=df.copy()
In [24]:
null_counts = df2.isnull().sum()
null_percentage = (null_counts / len(df2)) * 100
null_summary = pd.DataFrame({'Null Count': null_counts,'Null Percentage': null_percentage.round(2)})
null_summary
Out[24]:
Null Count Null Percentage
Unnamed: 0 0 0.00
price 0 0.00
rooms 410 2.50
bathroom 387 2.36
lift 0 0.00
terrace 0 0.00
square_meters 408 2.49
real_state 918 5.61
neighborhood 0 0.00
square_meters_price 439 2.68
In [25]:
# Create a new dataframe with rows that contain at least one missing value
df_missing = df[df.isnull().any(axis=1)]

# Reset index for better readability (optional)
df_missing = df_missing.reset_index(drop=True)
In [26]:
df_missing.shape
Out[26]:
(2311, 10)
In [27]:
df_missing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2311 entries, 0 to 2310
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           2311 non-null   int64  
 1   price                2311 non-null   int64  
 2   rooms                1901 non-null   float64
 3   bathroom             1924 non-null   float64
 4   lift                 2311 non-null   bool   
 5   terrace              2311 non-null   bool   
 6   square_meters        1903 non-null   float64
 7   real_state           1393 non-null   object 
 8   neighborhood         2311 non-null   object 
 9   square_meters_price  1872 non-null   float64
dtypes: bool(2), float64(4), int64(2), object(2)
memory usage: 149.1+ KB
In [28]:
mask1 = df2["square_meters"].isna() & df2["price"].notna() & df2["square_meters_price"].notna()
df2.loc[mask1, "square_meters"] = df2["price"] / df2["square_meters_price"]
df2.isnull().sum() # Checking for missing values in the data
Out[28]:
Unnamed: 0               0
price                    0
rooms                  410
bathroom               387
lift                     0
terrace                  0
square_meters           19
real_state             918
neighborhood             0
square_meters_price    439
dtype: int64
  • 389 out of 408 missing "square_meters" values are imputed considering relation "price" / "square_meters_price"
In [29]:
mask2 = df2["square_meters"].notna() & df2["price"].notna() & df2["square_meters_price"].isna()
df2.loc[mask2, "square_meters_price"] = df2["price"] / df2["square_meters"]
df2.isnull().sum() # Checking for missing values in the data
Out[29]:
Unnamed: 0               0
price                    0
rooms                  410
bathroom               387
lift                     0
terrace                  0
square_meters           19
real_state             918
neighborhood             0
square_meters_price     19
dtype: int64
  • 420 out of 439 missing "square_meters_price" values are imputed considering relation "price" / "square_meters"
In [30]:
df2[(df2['square_meters_price'].isnull())&(df2['square_meters'].isnull())]
Out[30]:
Unnamed: 0 price rooms bathroom lift terrace square_meters real_state neighborhood square_meters_price
8748 8748 1300 4.392 1.980 True False NaN flat Eixample NaN
8784 8784 850 0.950 0.995 False False NaN flat Sants-Montjuïc NaN
9118 9118 925 2.175 0.924 True False NaN flat Gràcia NaN
9321 9321 895 NaN 1.877 True False NaN flat Sants-Montjuïc NaN
9442 9442 800 0.924 1.092 False False NaN flat Horta- Guinardo NaN
9519 9519 995 2.858 2.161 False True NaN flat Eixample NaN
10167 10167 1218 3.686 2.140 True False NaN flat Sarria-Sant Gervasi NaN
11180 11180 600 NaN 0.917 True True NaN flat Sants-Montjuïc NaN
11496 11496 945 1.009 0.962 True True NaN attic Sarria-Sant Gervasi NaN
11959 11959 850 3.039 NaN True False NaN flat Horta- Guinardo NaN
12782 12782 1000 3.226 0.965 False False NaN flat Eixample NaN
13086 13086 790 3.637 0.963 True True NaN flat Eixample NaN
13189 13189 5300 NaN 4.695 False True NaN flat Les Corts NaN
13401 13401 850 0.973 0.917 True False NaN apartment Eixample NaN
13817 13817 1100 2.937 2.169 False False NaN flat Gràcia NaN
14693 14693 740 1.068 0.949 True False NaN flat Eixample NaN
15761 15761 1140 2.179 0.942 True False NaN attic Gràcia NaN
16118 16118 1500 3.255 1.986 True False NaN flat Sarria-Sant Gervasi NaN
16181 16181 800 1.050 1.082 True False NaN flat Gràcia NaN
  • There are 19 properties missing values on both "square_meters" and "square_meters_price"
In [31]:
df2['square_meters_price'] = df2['square_meters_price'].fillna(df2.groupby(['real_state', 'neighborhood'])['square_meters_price'].transform('mean'))
df2.isnull().sum() # Checking for missing values in the data
Out[31]:
Unnamed: 0               0
price                    0
rooms                  410
bathroom               387
lift                     0
terrace                  0
square_meters           19
real_state             918
neighborhood             0
square_meters_price      0
dtype: int64
  • 19 missing "square_meters_price" values are imputed with the most relevant mean based on the "real_state" and "neighborhood".
In [32]:
df2.loc[df2['square_meters'].isna(), 'square_meters'] = df2['price'] / df2['square_meters_price']
df2.isnull().sum() # Checking for missing values in the data
Out[32]:
Unnamed: 0               0
price                    0
rooms                  410
bathroom               387
lift                     0
terrace                  0
square_meters            0
real_state             918
neighborhood             0
square_meters_price      0
dtype: int64
  • 19 missing "square_meters" values are imputed considering relation "price" / "square_meters_price"
In [33]:
# Compute the most common (mode) real_state for each neighborhood
mode_real_state = df2.groupby("neighborhood")["real_state"].apply(lambda x: x.mode()[0] if not x.mode().empty else np.nan)

# Fill missing values in real_state based on the mode of each neighborhood
df2["real_state"] = df2["real_state"].fillna(df2["neighborhood"].map(mode_real_state))

df2.isnull().sum() # Checking for missing values in the data
Out[33]:
Unnamed: 0               0
price                    0
rooms                  410
bathroom               387
lift                     0
terrace                  0
square_meters            0
real_state               0
neighborhood             0
square_meters_price      0
dtype: int64
  • Imputed missing "real_state" values by filling them with the most common (mode) "real_state" for each "neighborhood".
In [34]:
#df2['rooms'] = df2['rooms'].fillna(df2.groupby(['real_state', 'neighborhood'])['rooms'].transform('mean'))
df2['rooms'] = df2['rooms'].fillna(df2.groupby(['real_state', 'neighborhood'])['rooms'].transform('median'))
df2.isnull().sum() # Checking for missing values in the data
Out[34]:
Unnamed: 0               0
price                    0
rooms                    0
bathroom               387
lift                     0
terrace                  0
square_meters            0
real_state               0
neighborhood             0
square_meters_price      0
dtype: int64
  • 410 missing "rooms" values are imputed with the most relevant median based on the "real_state" and "neighborhood".
In [35]:
#df2['bathroom'] = df2['bathroom'].fillna(df2.groupby(['real_state', 'neighborhood'])['bathroom'].transform('mean'))
df2['bathroom'] = df2['bathroom'].fillna(df2.groupby(['real_state', 'neighborhood'])['bathroom'].transform('median'))
df2.isnull().sum() # Checking for missing values in the data
Out[35]:
Unnamed: 0             0
price                  0
rooms                  0
bathroom               0
lift                   0
terrace                0
square_meters          0
real_state             0
neighborhood           0
square_meters_price    0
dtype: int64
  • 387 missing "bathroom" values are imputed with the most relevant median based on the "real_state" and "neighborhood".

Feature engineering¶

In [36]:
df3=df2.copy()
In [37]:
df3=df3.drop(['Unnamed: 0'],axis=1)
df3.head()
Out[37]:
price rooms bathroom lift terrace square_meters real_state neighborhood square_meters_price
0 750 3.0 1.0 True False 60.0 flat Horta- Guinardo 12.500000
1 770 2.0 1.0 True False 59.0 flat Sant Andreu 13.050847
2 1300 1.0 1.0 True True 30.0 flat Gràcia 43.333333
3 2800 1.0 1.0 True True 70.0 flat Ciutat Vella 40.000000
4 720 2.0 1.0 True False 44.0 flat Sant Andreu 16.363636
  • Removed the variable "Unnamed: 0" which had no value for modeling
In [38]:
df3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16376 entries, 0 to 16375
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   price                16376 non-null  int64  
 1   rooms                16376 non-null  float64
 2   bathroom             16376 non-null  float64
 3   lift                 16376 non-null  bool   
 4   terrace              16376 non-null  bool   
 5   square_meters        16376 non-null  float64
 6   real_state           16376 non-null  object 
 7   neighborhood         16376 non-null  object 
 8   square_meters_price  16376 non-null  float64
dtypes: bool(2), float64(4), int64(1), object(2)
memory usage: 927.7+ KB
In [39]:
# Select rows where 'rooms' or 'bathroom' contain float values
df3_float = df3[df3[["rooms", "bathroom"]].select_dtypes(include=["float64"]).notna().any(axis=1)]
In [40]:
df3_float.shape
Out[40]:
(16376, 9)
In [41]:
df3['rooms'] = df3['rooms'].apply(lambda x: 1 if x < 1 else round(x)).astype(int)
df3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16376 entries, 0 to 16375
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   price                16376 non-null  int64  
 1   rooms                16376 non-null  int64  
 2   bathroom             16376 non-null  float64
 3   lift                 16376 non-null  bool   
 4   terrace              16376 non-null  bool   
 5   square_meters        16376 non-null  float64
 6   real_state           16376 non-null  object 
 7   neighborhood         16376 non-null  object 
 8   square_meters_price  16376 non-null  float64
dtypes: bool(2), float64(3), int64(2), object(2)
memory usage: 927.7+ KB
In [42]:
df3['bathroom'] = df3['bathroom'].apply(lambda x: 1 if x < 1 else round(x)).astype(int)
df3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16376 entries, 0 to 16375
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   price                16376 non-null  int64  
 1   rooms                16376 non-null  int64  
 2   bathroom             16376 non-null  int64  
 3   lift                 16376 non-null  bool   
 4   terrace              16376 non-null  bool   
 5   square_meters        16376 non-null  float64
 6   real_state           16376 non-null  object 
 7   neighborhood         16376 non-null  object 
 8   square_meters_price  16376 non-null  float64
dtypes: bool(2), float64(2), int64(3), object(2)
memory usage: 927.7+ KB
  • Transformed the values of "rooms" and "bathroom" into an integer using the following logic:
    • Values under 1 → Set to 1
    • Values 1 or above → Round to the nearest integer
  • Variables "rooms" and "bathroom" set as integer
In [43]:
df3.describe(include="all").T # statistical summary of the data.
Out[43]:
count unique top freq mean std min 25% 50% 75% max
price 16376.0 NaN NaN NaN 1437.04586 1106.831419 320.0 875.0 1100.0 1514.0 15000.0
rooms 16376.0 NaN NaN NaN 2.447545 1.078844 1.0 2.0 2.0 3.0 11.0
bathroom 16376.0 NaN NaN NaN 1.495237 0.714843 1.0 1.0 1.0 2.0 8.0
lift 16376 2 True 11246 NaN NaN NaN NaN NaN NaN NaN
terrace 16376 2 False 12770 NaN NaN NaN NaN NaN NaN NaN
square_meters 16376.0 NaN NaN NaN 84.357363 47.454864 10.0 56.048 72.689 95.0 679.0
real_state 16376 4 flat 13568 NaN NaN NaN NaN NaN NaN NaN
neighborhood 16376 10 Eixample 4795 NaN NaN NaN NaN NaN NaN NaN
square_meters_price 16376.0 NaN NaN NaN 17.727253 9.185362 4.549 12.773723 15.315158 19.389167 197.272

Outliers detection and treatment¶

In [44]:
# function to check for outliers
def count_outliers(df):
    outlier_count=0
    for column in df.select_dtypes(include=np.number).columns:
        outliers=len(df[(df[column] < df[column].quantile(0.25)-1.5*(df[column].quantile(0.75)-df[column].quantile(0.25))) | (df[column] > df[column].quantile(0.75)+1.5*(df[column].quantile(0.75)-df[column].quantile(0.25)))][column])
        print(f'{column}: {outliers} outliers ({outliers/df.shape[0]*100:.2f}%)')
        outlier_count+= outliers
    return outlier_count
In [45]:
df4=df3.copy()
In [46]:
count_outliers(df)
Unnamed: 0: 0 outliers (0.00%)
price: 1778 outliers (10.86%)
rooms: 870 outliers (5.31%)
bathroom: 308 outliers (1.88%)
square_meters: 1177 outliers (7.19%)
square_meters_price: 1165 outliers (7.11%)
Out[46]:
5298
In [47]:
df.shape
Out[47]:
(16376, 10)
In [48]:
count_outliers(df4)
price: 1778 outliers (10.86%)
rooms: 505 outliers (3.08%)
bathroom: 308 outliers (1.88%)
square_meters: 1206 outliers (7.36%)
square_meters_price: 1201 outliers (7.33%)
Out[48]:
4998
In [49]:
df4.shape
Out[49]:
(16376, 9)
In [50]:
# Z-Score Method
df5=df4[(np.abs(df4.select_dtypes(include=np.number).apply(zscore))<2).all(axis=1)] #drop over 2 standard deviations
count_outliers(df5)
price: 960 outliers (6.73%)
rooms: 0 outliers (0.00%)
bathroom: 0 outliers (0.00%)
square_meters: 499 outliers (3.50%)
square_meters_price: 593 outliers (4.16%)
Out[50]:
2052
In [51]:
df5.shape
Out[51]:
(14269, 9)
  • Applied the Z-score method, which removes outliers with more than 2 standard deviations.
  • Some variables with a relevant percentage of outliers still remain. df5_shape:(14269, 9)
In [52]:
df6=df5.copy()
for column in df6.select_dtypes(include=np.number).columns:
    df6[column]=np.clip(df6[column], df6[column].quantile(0.25)-1.5*(df6[column].quantile(0.75)-df6[column].quantile(0.25)), df6[column].quantile(0.75)+1.5*(df6[column].quantile(0.75)-df6[column].quantile(0.25)))
df6.shape
Out[52]:
(14269, 9)
In [53]:
count_outliers(df6)
price: 0 outliers (0.00%)
rooms: 0 outliers (0.00%)
bathroom: 0 outliers (0.00%)
square_meters: 0 outliers (0.00%)
square_meters_price: 0 outliers (0.00%)
Out[53]:
0
  • Limiting outliers at whiskers (winsorization) is considered due to the nature of the data
  • Applying winsorization can hide valuable trends in luxury or budget properties, but in this case the extreme prices are assumed to be errors or anomalies in the synthetic or augmented data, so applying winsorization will make the model more robust to those outliers.
In [54]:
df6.info()
<class 'pandas.core.frame.DataFrame'>
Index: 14269 entries, 0 to 16375
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   price                14269 non-null  int64  
 1   rooms                14269 non-null  int64  
 2   bathroom             14269 non-null  int64  
 3   lift                 14269 non-null  bool   
 4   terrace              14269 non-null  bool   
 5   square_meters        14269 non-null  float64
 6   real_state           14269 non-null  object 
 7   neighborhood         14269 non-null  object 
 8   square_meters_price  14269 non-null  float64
dtypes: bool(2), float64(2), int64(3), object(2)
memory usage: 919.7+ KB
In [55]:
df6.describe(include="all").T # statistical summary of the data.
Out[55]:
count unique top freq mean std min 25% 50% 75% max
price 14269.0 NaN NaN NaN 1124.019833 371.448532 320.0 850.0 1000.0 1300.0 1975.0
rooms 14269.0 NaN NaN NaN 2.308291 0.939464 1.0 2.0 2.0 3.0 4.0
bathroom 14269.0 NaN NaN NaN 1.340949 0.474045 1.0 1.0 1.0 2.0 2.0
lift 14269 2 True 9753 NaN NaN NaN NaN NaN NaN NaN
terrace 14269 2 False 11384 NaN NaN NaN NaN NaN NaN NaN
square_meters 14269.0 NaN NaN NaN 73.41199 25.366439 10.313 55.0 70.0 87.264 135.66
real_state 14269 4 flat 12201 NaN NaN NaN NaN NaN NaN NaN
neighborhood 14269 10 Eixample 4154 NaN NaN NaN NaN NaN NaN NaN
square_meters_price 14269.0 NaN NaN NaN 16.1397 4.684702 6.001 12.676 15.0 18.681319 27.689297

Data Management¶

In [57]:
df.to_csv('df_ORIGINAL_DATA.csv', index=False)  # Save a copy of original data
In [58]:
df_missing.to_csv('df_MISSING_DATA.csv', index=False)  # Save a copy of missing data to be imputed
In [59]:
df2.to_csv('df2_IMPUTED_DATA.csv', index=False)  # Save a copy of data after imputitation of missing values
In [60]:
df3_float.to_csv('df3_WRONG FEATURES_DATA.csv', index=False)  # Save a copy of data before feature engineering
In [61]:
df3.to_csv('df3_FATURE ENGINEERED_DATA.csv', index=False)  # Save a copy of data after feature engineering
In [62]:
df6.to_csv('df6_WITHOUT OUTLIERS_DATA.csv', index=False)  # Save a copy of data after outliers handling
  • 'df_ORIGINAL_DATA.csv': Reference dataset as a copy of original data.
  • 'df_MISSING_DATA.csv': Data subset filtered by missing value data.
  • 'df2_IMPUTED_DATA.csv': Updated dataset after the imputation of the missing values.
  • 'df3_WRONG FEATURES_DATA.csv': Data subset filtered by data subject to feature engineering.
  • 'df3_FEATURE ENGINEERED_DATA.csv': Updated dataset after feature engineering.
  • 'df6_WITHOUT OUTLIERS_DATA.csv': Updated dataset after handling outliers

Data Preparation Consolidated Notes¶

Data Overview

  • The variable 'Unnamed' represent index and should be deleted from data
  • Target variable for modeling is "price"
  • There are 16376 rows and 10 columns.
  • Project1 data had 8188 rows and 10 columns.
  • Data types are aligned with information, except variables 'rooms' and 'bathroom' being float and expected integer
  • There are missing data (NaN) on multiple variables
  • Units size goes from 10m2 to 679m2, with a mean of 84.36m2
  • Units prices goes from 320EUR to 15000EUR/month, with mean of 1437EUR/month
  • price range is assumed referred to monthly rent, so considered as EUR per month
  • Units prices by square meter goes from 4.549EUR/m2/month to 197.272EUR/m2/month, with mean of 17.73EUR/m2/month
  • There are units listed with cero rooms and 10.754 rooms
  • There are units with 0.9 bathroom
  • There are four types of real states being the most common "flat"
  • Most units do not have terrace
  • Most units do have lift
  • The neighborhood with largest unit count is "Eixample"
  • The variable 'rooms' will require feature engineering
  • The variable 'bathroom' will require feature engineering
  • The total number of observations with an integer number for variable "rooms" is 8000, this represents 48.85% of total observations
  • The total number of observations with an integer number for variable "bathroom" is 8210, this represents 50.13% of total observations
  • Considering high proportion of invalid values (not integer) on variables 'rooms' and 'bathroom' (51.15% and 49.87%), and Project2 dataset is stated as an augmented version of Project1 dataset, is interpreted Project1 dataset was augmented adding artificial data to make it larger, and in this process of Data Augmentation those observations with decimal values where not corrected to integers into Project2 dataset.
  • There are no duplicated observations

Missing Value handling

  • 389 out of 408 missing "square_meters" values are imputed considering relation "price" / "square_meters_price"
  • 420 out of 439 missing "square_meters_price" values are imputed considering relation "price" / "square_meters"
  • There are 19 properties missing values on both "square_meters" and "square_meters_price"
  • 19 missing "square_meters_price" values are imputed with the most relevant mean based on the "real_state" and "neighborhood".
  • 19 missing "square_meters" values are imputed considering relation "price" / "square_meters_price"
  • Imputed missing "real_state" values by filling them with the most common (mode) "real_state" for each "neighborhood".
  • 408 missing "rooms" values are imputed with the most relevant median based on the "real_state" and "neighborhood".
  • 387 missing "bathroom" values are imputed with the most relevant median based on the "real_state" and "neighborhood".

Feature engineering

  • Removed the variable "Unnamed: 0" which had no value for modeling
  • Transformed the values of "rooms" and "bathroom" into an integer using the following logic:
    • Values under 1 → Set to 1
    • Values 1 or above → Round to the nearest integer
  • Variables "rooms" and "bathroom" set as integer

Outliers detection and treatment

  • Applied the Z-score method, which removes outliers with more than 2 standard deviations.
  • Some variables with a relevant percentage of outliers still remain. df5_shape:(14269, 9)
  • Limiting outliers at whiskers (winsorization) is considered due to the nature of the data
  • Applying winsorization can hide valuable trends in luxury or budget properties, but in this case the extreme prices are assumed to be errors or anomalies in the synthetic or augmented data, so applying winsorization will make the model more robust to those outliers.

Data Management

  • 'df_ORIGINAL_DATA.csv': Reference dataset as a copy of original data.
  • 'df_MISSING_DATA.csv': Data subset filtered by missing value data.
  • 'df2_IMPUTED_DATA.csv': Updated dataset after the imputation of the missing values.
  • 'df3_WRONG FEATURES_DATA.csv': Data subset filtered by data subject to feature engineering.
  • 'df3_FEATURE ENGINEERED_DATA.csv': Updated dataset after feature engineering.
  • 'df6_WITHOUT OUTLIERS_DATA.csv': Updated dataset after handling outliers

4. Exploratory Data Analysis (EDA)¶

Analyzing the data to understand patterns, relationships, and potential anomalies. This step often involves data visualization and statistical analysis to generate insights.

EDA Functions¶

In [63]:
def univariate_numerical(data):
    '''
    Function to generate two plots for each numerical variable
    Histplot for variable distribution
    Boxplot for statistical summary 
    '''
    # Select numerical columns
    numerical_cols = data.select_dtypes(include=[np.number]).columns
    
    # Determine the number of rows and columns
    num_vars = len(numerical_cols)
    num_cols = 4
    num_rows = int(np.ceil(num_vars * 2 / num_cols))
    
    # Create a figure with the specified size
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(5*num_cols, num_rows * 5))
    
    # Flatten the axes array for easy iteration
    axes = axes.flatten()
    
    # Plot each variable with a histplot and a boxplot
    for i, col in enumerate(numerical_cols):
        mean_value = data[col].mean()
        
        # Histplot with KDE
        sns.histplot(data[col], kde=True, ax=axes[i*2])
        axes[i*2].axvline(mean_value, color='r', linestyle='--')
        axes[i*2].set_title(f'Distribution of {col}')
        axes[i*2].text(mean_value, axes[i*2].get_ylim()[1]*0.8, f'Mean: {mean_value:.2f}', color='r', va='baseline', ha='left',rotation=90)
        
        # Boxplot
        sns.boxplot(y=data[col], ax=axes[i*2 + 1])
        axes[i*2 + 1].axhline(mean_value, color='r', linestyle='--')
        axes[i*2 + 1].set_title(f'Boxplot of {col}')
        axes[i*2 + 1].text(axes[i*2 + 1].get_xlim()[1]*0.8, mean_value, f'mean: {mean_value:.2f}', color='r', va='baseline', ha='right')
    
    # Hide any remaining empty subplots
    for j in range(num_vars * 2, len(axes)):
        fig.delaxes(axes[j])
    
    # Adjust layout
    plt.tight_layout()
    plt.show()
In [64]:
def univariate_categorical(data):
    '''
    Function to generate countplot for each categorical variable
    Labeled with count and percentage
    '''
    # List of categorical columns
    categorical_columns = data.select_dtypes(include=['object', 'category']).columns.tolist()
    
    # Number of columns in the grid
    num_cols = 4
    
    # Calculate the number of rows needed
    num_rows = (len(categorical_columns) + num_cols - 1) // num_cols
    
    # Create the grid
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(5*num_cols, num_rows * 5), constrained_layout=True)
    axes = axes.flatten()
    
    # Plot each countplot in the grid
    for i, col in enumerate(categorical_columns):
        ax = axes[i]
        plot = sns.countplot(x=col, data=data, order=data[col].value_counts().index, ax=ax)
        ax.set_title(f'Count of {col}')
           
        # Add total count and percentage annotations
        total = len(data)
        for p in plot.patches:
            height = p.get_height()
            percentage = f'{(height / total * 100):.1f}%'
            plot.text(x=p.get_x() + p.get_width() / 2,
                      y=height + 2,
                      s=f'{height:.0f}\n({percentage})',
                      ha='center')
        
        # Limit x-axis labels to avoid overlap
        ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
    
    # Remove any empty subplots
    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])
    
    # Show the plot
    plt.show()
In [65]:
# Function to plot crosstab with labels
def plot_crosstab_bar_count(df, var_interest):
    '''
    Function to create a barplot of crosstab of the variable of interest vs each of the rest of categorical variables
    Labeled with counts
    '''
    # Extract categorical columns excluding the variable of interest
    cat_cols = df.select_dtypes(include=['category', 'object','bool']).columns.tolist()
    cat_cols.remove(var_interest)
    
    # Determine the grid size
    num_vars = len(cat_cols)
    num_cols = 3  # Number of columns in the grid
    num_rows = (num_vars // num_cols) + int(num_vars % num_cols > 0)

    # Create a grid of subplots
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(5*num_cols, num_rows * 5), constrained_layout=True)
    axes = axes.flatten()  # Flatten the axes array for easy iteration

    for i, col in enumerate(cat_cols):
        # Create a crosstab
        crosstab = pd.crosstab(df[col], df[var_interest])
        
        # Plot the crosstab as a bar plot
        crosstab.plot(kind='bar', stacked=True, ax=axes[i])
        
        # Annotate counts in the middle of each bar section
        for bar in axes[i].patches:
            height = bar.get_height()
            if height > 0:
                axes[i].annotate(f'{int(height)}', 
                                 (bar.get_x() + bar.get_width() / 2, bar.get_y() + height / 2),
                                 ha='center', va='center', fontsize=10, color='black')
        
        # Add total labels at the top of each bar
        totals = crosstab.sum(axis=1)
        for j, total in enumerate(totals):
            axes[i].annotate(f'Total: {total}', 
                             (j, totals[j]), 
                             ha='center', va='bottom', weight='bold')

    # Hide any remaining empty subplots
    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])

    plt.tight_layout()
    plt.show()

# Usage
#plot_crosstab_bar_count(df, var_interest='var_interest')
In [66]:
def plot_crosstab_heat_perc(df, var_interest,df_name="DataFrame"):
    '''
    Function to create a heatmap of crosstab of the variable of interest vs each of the rest of catagorical variables
    Labeled with counts, percentage by row, percentage by column
    '''
    # Extract categorical columns excluding the variable of interest
    cat_cols = df.select_dtypes(include=['category', 'object']).columns.tolist()
    cat_cols.remove(var_interest)
    
    # Determine the grid size
    num_vars = len(cat_cols)
    num_cols = 3  # Number of columns in the grid
    num_rows = (num_vars // num_cols) + int(num_vars % num_cols > 0)

    # Create a grid of subplots
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(6*num_cols, num_rows * 6))
    axes = axes.flatten()  # Flatten the axes array for easy iteration

    for i, col in enumerate(cat_cols):
        # Create crosstabs
        crosstab = pd.crosstab(df[col], df[var_interest])
        crosstab_perc_row = crosstab.div(crosstab.sum(axis=1), axis=0) * 100
        crosstab_perc_col = crosstab.div(crosstab.sum(axis=0), axis=1) * 100

        # Combine counts with percentages
        crosstab_combined = crosstab.astype(str) + "\n" + \
                            crosstab_perc_row.round(2).astype(str) + "%" + "\n" + \
                            crosstab_perc_col.round(2).astype(str) + "%"

        # Plot the crosstab as a heatmap
        sns.heatmap(crosstab, annot=crosstab_combined, fmt='', cmap='Blues', ax=axes[i], cbar=False, annot_kws={"size": 8})
        axes[i].set_title(f'Crosstab of {col} and {var_interest} - {df_name}', fontsize=12)

    # Hide any remaining empty subplots
    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])

    # Adjust layout to prevent label overlapping
    plt.subplots_adjust(hspace=0.4, wspace=0.4)  # Add more space between subplots
    plt.tight_layout()
    plt.show()
    
# Usage
#plot_crosstab_heat_perc(df, var_interest='var_interest')
In [67]:
def boxplot_by_group(df, group, var, outliers, df_name="DataFrame"):
    '''
    boxplot for a numerical variable of interest vs a categorical variable
    with or without outliers
    includes data mean and mean by category
    '''
    # Calculate the average for the variable
    var_avg = df[var].mean()
    
    # Calculate variable mean per group
    var_means = df.groupby(group)[var].mean()
    
    # Sort by means and get the sorted order
    var_sorted = var_means.sort_values(ascending=False).index
    
    # Reorder the DataFrame by the sorted group
    df[group] = pd.Categorical(df[group], categories=var_sorted, ordered=True)
    
    # Create the boxplot with the reordered sectors
    ax = sns.boxplot(data=df, x=group, y=var, order=var_sorted, showfliers=outliers)
    
    # Add horizontal line for average variable value
    plt.axhline(var_avg, color='red', linestyle='--', label=f'Avg {var}: {var_avg:.2f}')
    
    # Scatter plot for means
    x_positions = range(len(var_means.sort_values(ascending=False)))
    plt.scatter(x=x_positions, y=var_means.sort_values(ascending=False), color='red', label='Mean', zorder=5)
    
    # Add labels to each red dot with the mean value
    for i, mean in enumerate(var_means.sort_values(ascending=False)):
        plt.text(i, mean, f'{mean:.2f}', color='red', ha='center', va='bottom')
    
    # Rotate x-axis labels
    plt.xticks(ticks=x_positions, labels=var_means.sort_values(ascending=False).index, rotation=90)
    
    # Add a legend
    plt.legend()
    plt.xlabel('')  # Remove x-axis title
    
    # Add plot title with DataFrame name
    plt.title(f'Boxplot of {var} by {group} - {df_name}')
    
    # Adjust layout
    plt.tight_layout()
    
    # Display the plot
    #plt.show()

    # Get the top 3 categories
    top_3_categories = var_means.sort_values(ascending=False).head(3).index.tolist()
    top_3=",".join(top_3_categories)
    # Print the top 3 categories
    print(f'Top 3 {group} by {var} mean value are: {top_3}')
In [68]:
# Define the function to create and display side-by-side boxplots
def side_by_side_boxplot(df1, df2, group, var, outliers, title1, title2):
    fig, axes = plt.subplots(1, 2, figsize=(18, 6), sharey=True)
    
    # First subplot for df1
    plt.sca(axes[0])
    boxplot_by_group(df1, group, var, outliers, title1)
    
    # Second subplot for df2
    plt.sca(axes[1])
    boxplot_by_group(df2, group, var, outliers, title2)
    
    # Show both plots after setup
    plt.show()

# Usage
#side_by_side_boxplot(df, df_pop, 'neighborhood', 'price', True, "All units (show outliers)", "Popular units (show outliers)")

Functions

  • univariate_numerical(data): Function to generate two plots for each numerical variable. Histplot for variable distribution. Boxplot for statistical summary
  • univariate_categorical(data): Function to generate countplot for each categorical variable. Labeled with count and percentage
  • plot_crosstab_bar_count(df, var_interest): Function to create a barplot of crosstab of the variable of interest vs each of the rest of categorical variables. Labeled with counts
  • plot_crosstab_heat_perc(df, var_interest): Function to create a heatmap of crosstab of the variable of interest vs each of the rest of catagorical variables. Labeled with counts, percentage by row, percentage by column
  • boxplot_by_group(df, group, var, outliers): boxplot for a numerical variable of interest vs a categorical variable. with or without outliers. includes data mean and mean by category
  • side_by_side_boxplot(df1, df2, group, var, outliers, title1, title2): to present two side by side boxplot_by_group

Univariate Analysis¶

In [69]:
univariate_numerical(df)
No description has been provided for this image
In [70]:
univariate_numerical(df6)
No description has been provided for this image
  • 'price', 'square_meters' and 'square_meters_price' variables are right skewed and reflect the effect of capping outliers to upper whysker.
  • Comparing original data (df) vs. prepared data (df6) is noticeable how in original data the numerical variables have float type values and many outliers, while in prepared data the numerical variables have integer values and no outliers.
In [71]:
univariate_categorical(df6)
No description has been provided for this image
In [72]:
df6.loc[(df6['real_state']=="flat")].describe().T
Out[72]:
count mean std min 25% 50% 75% max
price 12201.0 1097.272601 344.695856 320.000 850.000 1000.000 1250.000000 1975.000000
rooms 12201.0 2.384067 0.930198 1.000 2.000 2.000 3.000000 4.000000
bathroom 12201.0 1.353004 0.477923 1.000 1.000 1.000 2.000000 2.000000
square_meters 12201.0 74.623287 24.862834 13.181 56.559 70.902 88.388000 135.660000
square_meters_price 12201.0 15.435230 4.183763 6.001 12.465 14.516 17.647059 27.689297
In [73]:
df.loc[(df['real_state']=="flat")].describe().T
Out[73]:
count mean std min 25% 50% 75% max
Unnamed: 0 12650.0 8038.174625 4750.342568 0.000000 3887.250000 7965.5000 12162.75000 16375.000
price 12650.0 1311.412490 917.152962 320.000000 865.000000 1050.0000 1352.00000 15000.000
rooms 12351.0 2.551887 1.091363 0.000000 2.000000 2.7380 3.03600 10.754
bathroom 12380.0 1.509471 0.715738 0.900000 1.000000 1.0405 2.00000 8.000
square_meters 12352.0 85.484011 45.657731 10.540000 59.000000 74.7985 95.64450 679.000
square_meters_price 12322.0 15.707694 5.333934 5.555556 12.437625 14.5000 17.67775 103.176
  • In the prepared data there are flats units with 4 rooms and 135m2 area.
  • In the original data there are flats units with 10.754 rooms and 679m2 area.
  • The "large flats" units in the data are asummed as unreal/not-valid data and are affected by Data Preparation.
In [74]:
df6.loc[(df6['neighborhood']=="Eixample")].describe().T
Out[74]:
count mean std min 25% 50% 75% max
price 4154.0 1186.479779 376.948534 425.000 900.00000 1100.0000 1400.0000 1975.000000
rooms 4154.0 2.403707 0.940306 1.000 2.00000 2.0000 3.0000 4.000000
bathroom 4154.0 1.391911 0.488236 1.000 1.00000 1.0000 2.0000 2.000000
square_meters 4154.0 76.669525 25.506765 16.197 58.00000 74.0505 90.2625 135.660000
square_meters_price 4154.0 16.357695 4.795151 6.074 12.79825 15.0455 19.1155 27.689297
  • The categorical variables are not balanced, with 85.5% of properties as "flats" and 78.5% of units concentrated in 50% of the sample neighbourhoods
  • 75% of flats units have up to 3 bedrooms and up to 2 bathrooms with an average size of 85.48m2.
  • 75% of the units in Eixample have up to 3 bedrooms and up to 2 bathrooms with an average size of 80.21m2.

Bivariate Analysis¶

In [75]:
# Create a PairGrid
g = sns.PairGrid(df6, corner=True)

# Map different plots to the grid
g.map_lower(sns.scatterplot)
g.map_diag(sns.histplot,kde=True)

# Show the plot
plt.show()
No description has been provided for this image
In [76]:
# Calculate correlation matrix
corr_matrix = df6.select_dtypes(include=np.number).corr()
In [77]:
# Plot correlation matrix as heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix Heatmap')
plt.show()
No description has been provided for this image
In [78]:
# Display the sorted correlation table
corr_unstacked = corr_matrix.unstack() # Unstack the correlation matrix
corr_unstacked = corr_unstacked.reset_index() # Reset the index to get 'variable1' and 'variable2' as columns
corr_unstacked.columns = ['variable1', 'variable2', 'correlation']# Rename the columns for better understanding
corr_unstacked = corr_unstacked[corr_unstacked['variable1'] != corr_unstacked['variable2']] # Remove self-correlations by filtering out rows where variable1 == variable2
corr_unstacked = corr_unstacked.drop_duplicates(subset=['correlation']) # Drop duplicates to keep only one entry per variable pair
sorted_corr = corr_unstacked.sort_values(by='correlation', key=abs, ascending=False) # Sort the DataFrame by the absolute value of correlation
#sorted_corr # Display the sorted correlation table
In [79]:
# Define a function to categorize the correlation level
def categorize_correlation(correlation):
    abs_corr = abs(correlation) * 100  # Convert to percentage for easier comparison
    if abs_corr < 30:
        return 'Negligible'
    elif 30 <= abs_corr < 50:
        return 'Low'
    elif 50 <= abs_corr < 70:
        return 'Moderate'
    elif 70 <= abs_corr < 90:
        return 'High'
    else:
        return 'Very High'
In [80]:
# Apply the function to create the corr_lvl column
sorted_corr['corr_lvl'] = sorted_corr['correlation'].apply(categorize_correlation)
sorted_corr['corr_lvl'].value_counts()
Out[80]:
corr_lvl
Low           5
Moderate      4
Negligible    1
Name: count, dtype: int64
In [81]:
sorted_corr
Out[81]:
variable1 variable2 correlation corr_lvl
3 price square_meters 0.651214 Moderate
8 rooms square_meters 0.635150 Moderate
13 bathroom square_meters 0.608319 Moderate
2 price bathroom 0.503768 Moderate
7 rooms bathroom 0.451065 Low
9 rooms square_meters_price -0.416305 Low
19 square_meters square_meters_price -0.391874 Low
4 price square_meters_price 0.381253 Low
1 price rooms 0.304009 Low
14 bathroom square_meters_price -0.111716 Negligible
  • There are no couple of variables with high correlation (>75%)
In [82]:
boxplot_by_group(df6, 'neighborhood', 'price', False, df_name="(prepared data)")
Top 3 neighborhood by price mean value are: Sarria-Sant Gervasi,Eixample,Les Corts
No description has been provided for this image
In [83]:
boxplot_by_group(df6, 'neighborhood', 'square_meters', False, df_name="(prepared data)")
Top 3 neighborhood by square_meters mean value are: Eixample,Sarria-Sant Gervasi,Les Corts
No description has been provided for this image
In [84]:
boxplot_by_group(df6, 'neighborhood', 'square_meters_price', False, df_name="(prepared data)")
Top 3 neighborhood by square_meters_price mean value are: Ciutat Vella,Sarria-Sant Gervasi,Eixample
No description has been provided for this image
In [85]:
boxplot_by_group(df6, 'real_state', 'price', False, df_name="(prepared data)")
Top 3 real_state by price mean value are: apartment,attic,flat
No description has been provided for this image
In [86]:
boxplot_by_group(df6, 'real_state', 'square_meters', False, df_name="(prepared data)")
Top 3 real_state by square_meters mean value are: flat,attic,apartment
No description has been provided for this image
In [87]:
boxplot_by_group(df6, 'real_state', 'square_meters_price', False, df_name="(prepared data)")
Top 3 real_state by square_meters_price mean value are: apartment,study,attic
No description has been provided for this image
  • Top 3 neighborhood by price mean value are: Sarria-Sant Gervasi,Eixample,Les Corts
  • Top 3 neighborhood by square_meters mean value are: Eixample,Sarria-Sant Gervasi,Les Corts
  • Top 3 neighborhood by square_meters_price mean value are: Ciutat Vella,Sarria-Sant Gervasi,Eixample
  • Top 3 real_state by price mean value are: apartment,attic,flat
  • Top 3 real_state by square_meters mean value are: flat,attic,apartment
  • Top 3 real_state by square_meters_price mean value are: apartment,study,attic
  • From the perspective of price per square meter, the most attractive type of unit according to this data could be the flat, with an average surface area of ​74.62 m2 (just over the average 73.41 m2) and a price per square meter of 15.44 below the average (16.14)
In [88]:
plot_crosstab_heat_perc(df6, var_interest='real_state',df_name="prepared data")
No description has been provided for this image
  • There are 3544 flats in Eixample, being the most popular unit type and neighborhood combination, with 85.32% of the units in Eixample being flats, and 29.05% of all flats are located at Eixample.
  • Across all neighborhoods, the unit type "flat" is the most popular with at least 85.32% of units by neighborhood
In [89]:
plot_crosstab_bar_count(df6, var_interest='lift')
No description has been provided for this image
  • Most types of units have a lift, in the case of flats the proportion is 71%
In [90]:
plot_crosstab_bar_count(df6, var_interest='terrace')
No description has been provided for this image
  • Units with a terrace on the other hand, seem to be rare and very few have one

Exploratory Data Analysis Consolidated Notes¶

Functions

  • univariate_numerical(data): Function to generate two plots for each numerical variable. Histplot for variable distribution. Boxplot for statistical summary
  • univariate_categorical(data): Function to generate countplot for each categorical variable. Labeled with count and percentage
  • plot_crosstab_bar_count(df, var_interest): Function to create a barplot of crosstab of the variable of interest vs each of the rest of categorical variables. Labeled with counts
  • plot_crosstab_heat_perc(df, var_interest): Function to create a heatmap of crosstab of the variable of interest vs each of the rest of catagorical variables. Labeled with counts, percentage by row, percentage by column
  • boxplot_by_group(df, group, var, outliers): boxplot for a numerical variable of interest vs a categorical variable. with or without outliers. includes data mean and mean by category
  • side_by_side_boxplot(df1, df2, group, var, outliers, title1, title2): to present two side by side boxplot_by_group

Univariate Analysis

  • 'price', 'square_meters' and 'square_meters_price' variables are right skewed and reflect the effect of capping outliers to upper whysker.
  • Comparing original data (df) vs. prepared data (df6) is noticeable how in original data the numerical variables have float type values and many outliers, while in prepared data the numerical variables have integer values and no outliers.
  • In the prepared data there are flats units with 4 rooms and 135m2 area.
  • In the original data there are flats units with 10.754 rooms and 679m2 area.
  • The "large flats" units in the data are asummed as unreal/not-valid data and are affected by Data Preparation.
  • The categorical variables are not balanced, with 85.5% of properties as "flats" and 78.5% of units concentrated in 50% of the sample neighbourhoods
  • 75% of flats units have up to 3 bedrooms and up to 2 bathrooms with an average size of 85.48m2.
  • 75% of the units in Eixample have up to 3 bedrooms and up to 2 bathrooms with an average size of 80.21m2.

Bivariate Analysis

  • There are no couple of variables with high correlation (>75%)
  • Top 3 neighborhood by price mean value are: Sarria-Sant Gervasi,Eixample,Les Corts
  • Top 3 neighborhood by square_meters mean value are: Sarria-Sant Gervasi,Les Corts,Eixample
  • Top 3 neighborhood by square_meters_price mean value are: Ciutat Vella,Sarria-Sant Gervasi,Eixample
  • Top 3 real_state by price mean value are: apartment,attic,flat
  • Top 3 real_state by square_meters mean value are: flat,attic,apartment
  • Top 3 real_state by square_meters_price mean value are: apartment,study,attic
  • From the perspective of price per square meter, the most attractive type of unit according to this data could be the flat, with an average surface area of ​74.62 m2 (just over the average 73.41 m2) and a price per square meter of 15.44 below the average (16.14)
  • There are 3544 flats in Eixample, being the most popular unit type and neighborhood combination, with 85.32% of the units in Eixample being flats, and 29.05% of all flats are located at Eixample.
  • Across all neighborhoods, the unit type "flat" is the most popular with at least 85.32% of units by neighborhood
  • Most types of units have a lift, in the case of flats the proportion is 71%
  • Units with a terrace on the other hand, seem to be rare and very few have one

5. Modeling¶

Selecting and applying appropriate machine learning or statistical models. This step includes training, validating, and fine-tuning models to optimize their performance

Modeling Functions¶

In [91]:
# Define a function to evaluate and return the model's metrics
def evaluate_model(model, x_test, y_test):
    y_pred = model.predict(x_test)
    metrics = {
        "MAE": mean_absolute_error(y_test, y_pred),
        "MSE": mean_squared_error(y_test, y_pred),
        "RMSE": np.sqrt(mean_squared_error(y_test, y_pred)),
        "R2 Score": r2_score(y_test, y_pred)
    }
    return metrics
In [92]:
def evaluate_models_with_cv(models, X_train, y_train, X_test, y_test):
    """
    Evaluates multiple regression models using cross-validation and final test set performance.
    
    Parameters:
    models: list of tuples (model_name, model_instance)
    X_train, y_train: training data
    X_test, y_test: test data

    Returns:
    - results_df: DataFrame containing CV and test metrics for each model
    - trained_models: Dictionary of trained models for future use
    """

    results_list = []  # List to store model results
    trained_models = {}  # Dictionary to store trained models
    
    # Define 5-fold cross-validation
    kfold = KFold(n_splits=5, shuffle=True, random_state=1)

    for name, model in models:
        # Perform cross-validation on training set
        cv_results = cross_validate(
            model, X_train, y_train, 
            scoring=["neg_mean_absolute_error", "neg_mean_squared_error", "r2"],
            cv=kfold, return_train_score=False
        )

        # Extract mean values and convert negatives to positives
        train_mae = -cv_results["test_neg_mean_absolute_error"].mean()
        train_mse = -cv_results["test_neg_mean_squared_error"].mean()
        train_rmse = np.sqrt(train_mse)
        train_r2 = cv_results["test_r2"].mean()

        # Append CV results to list
        results_list.append({
            "Model": f"{name}_CV",
            "MAE": train_mae,
            "MSE": train_mse,
            "RMSE": train_rmse,
            "R2 Score": train_r2
        })

        # Train model on full training data and evaluate on test set
        model.fit(X_train, y_train)
        trained_models[name] = model  # Store trained model

        y_pred = model.predict(X_test)
        test_mae = mean_absolute_error(y_test, y_pred)
        test_mse = mean_squared_error(y_test, y_pred)
        test_rmse = np.sqrt(test_mse)
        test_r2 = r2_score(y_test, y_pred)

        # Append test set results to list
        results_list.append({
            "Model": f"{name}_Test",
            "MAE": test_mae,
            "MSE": test_mse,
            "RMSE": test_rmse,
            "R2 Score": test_r2
        })

    # Convert results list to DataFrame
    results_cv = pd.DataFrame(results_list)
    
    return results_cv, trained_models
In [93]:
def univariate_numerical_y(y):
    """
    Function to generate two plots for the numerical variable y:
    - Histogram for variable distribution
    - Boxplot for statistical summary
    """
    # Create a figure with two subplots
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))

    # Histogram
    axes[0].hist(y, bins=30, color='blue', alpha=0.7)
    axes[0].set_title('Histogram of y')
    axes[0].set_xlabel('Value')
    axes[0].set_ylabel('Frequency')

    # Boxplot
    axes[1].boxplot(y, vert=False)
    axes[1].set_title('Boxplot of y')
    axes[1].set_xlabel('Value')

    plt.tight_layout()
    plt.show()
  • Defined function "evaluate_model(model, x_test, y_test)", to evaluate and return the model's metrics into a results dataframe
  • Defined function "evaluate_models_with_cv(models, X_train, y_train, X_test, y_test)" to evaluates multiple regression models using cross-validation and final test set performance.
  • Defined function "univariate_numerical_y()", to generate two plots (Histogram and Boxplot) for the numerical variable y

Preparing data for modeling¶

In [94]:
data=df6.copy()
  • Modeling data (data) will be done over a copy of prepared data (df6)
In [95]:
# 1. Specify independent (X) and dependent (y) variables
X = data.drop(["price"], axis=1)
y = data["price"]

# 2. Create dummy variables for categorical features
X = pd.get_dummies(X, columns=['real_state', 'neighborhood'], drop_first=True)  # drop_first=True to avoid multicollinearity

# 3. Convert boolean columns to numeric (0 and 1)
bool_cols = X.select_dtypes(['bool'])
for col in bool_cols.columns:
    X[col] = X[col].astype('int')

# 4. Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

# 5. Transform and scale right-skewed variables (applied **only to training data to avoid data leakage**)
pt = PowerTransformer(method='yeo-johnson')  # Works with zero/negative values

# Fit only on training data, then transform both training and test data
X_train[['square_meters', 'square_meters_price']] = pt.fit_transform(X_train[['square_meters', 'square_meters_price']])
X_test[['square_meters', 'square_meters_price']] = pt.transform(X_test[['square_meters', 'square_meters_price']])  # Transform only

# 6. Standardize the transformed numerical features (again, to prevent data leakage)
scaler = StandardScaler()
X_train[['square_meters', 'square_meters_price']] = scaler.fit_transform(X_train[['square_meters', 'square_meters_price']])
X_test[['square_meters', 'square_meters_price']] = scaler.transform(X_test[['square_meters', 'square_meters_price']])  # Use the same scaler

# 7. Add a constant to independent variables (after scaling, only for models that need it)
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)
In [96]:
# Checking training and test sets.
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
Shape of Training set :  (9988, 19)
Shape of test set :  (4281, 19)
In [97]:
X_train.head()
Out[97]:
const rooms bathroom lift terrace square_meters square_meters_price real_state_attic real_state_flat real_state_study neighborhood_Eixample neighborhood_Les Corts neighborhood_Sant Martí neighborhood_Ciutat Vella neighborhood_Gràcia neighborhood_Sants-Montjuïc neighborhood_Sant Andreu neighborhood_Horta- Guinardo neighborhood_Nou Barris
16047 1.0 2 1 1 0 0.163420 0.503709 0 1 0 0 0 0 0 0 0 0 0 0
10334 1.0 1 1 1 0 -1.500990 1.774314 0 1 0 1 0 0 0 0 0 0 0 0
10144 1.0 2 1 0 0 -1.953426 0.850089 0 1 0 0 0 0 0 0 0 0 0 0
8401 1.0 3 1 1 1 1.666115 -0.559885 0 1 0 0 0 0 0 0 0 0 0 0
2041 1.0 3 1 0 1 -0.098884 -1.895277 0 1 0 0 0 0 0 0 0 0 1 0
In [98]:
X_train.info()
<class 'pandas.core.frame.DataFrame'>
Index: 9988 entries, 16047 to 15324
Data columns (total 19 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   const                         9988 non-null   float64
 1   rooms                         9988 non-null   int64  
 2   bathroom                      9988 non-null   int64  
 3   lift                          9988 non-null   int64  
 4   terrace                       9988 non-null   int64  
 5   square_meters                 9988 non-null   float64
 6   square_meters_price           9988 non-null   float64
 7   real_state_attic              9988 non-null   int64  
 8   real_state_flat               9988 non-null   int64  
 9   real_state_study              9988 non-null   int64  
 10  neighborhood_Eixample         9988 non-null   int64  
 11  neighborhood_Les Corts        9988 non-null   int64  
 12  neighborhood_Sant Martí       9988 non-null   int64  
 13  neighborhood_Ciutat Vella     9988 non-null   int64  
 14  neighborhood_Gràcia           9988 non-null   int64  
 15  neighborhood_Sants-Montjuïc   9988 non-null   int64  
 16  neighborhood_Sant Andreu      9988 non-null   int64  
 17  neighborhood_Horta- Guinardo  9988 non-null   int64  
 18  neighborhood_Nou Barris       9988 non-null   int64  
dtypes: float64(3), int64(16)
memory usage: 1.5 MB
  • The dataset contains numerical features with different scales, which may affect algorithms sensitive to scale.
  • Several models will be tried, including models that rely on distance-based calculations (Logistic Regression, SVM, KNN) that perform better with standardized data, and also linear models (Linear/Logistic Regression, Ridge, Lasso) that can converge faster with standardized inputs.
  • Due the different scales, and models to be evaluated, the data will be standarized:
    • 'price' is the target variable. Standardizing the target (y) is not necessary for most regression models
    • 'rooms' and 'bathrooms' show a discrete distribution, which has peaks at certain integer values. No scaling considered.
    • cathegorical or binary variables such as 'lift' , 'terrace', 'real_state' and 'neighborhood' do not need scaling.
    • 'square_meters' and 'square_meters_price' have right-skewed distributions and will be transformed using PowerTransformer (Yeo-Johnson) before applying StandardScaler.
In [99]:
univariate_numerical(X_train)
No description has been provided for this image
In [100]:
univariate_numerical_y(y_train)
No description has been provided for this image

Modeling Consolidated Notes¶

  • Defined function "evaluate_model(model, x_test, y_test)", to evaluate and return the model's metrics into a results dataframe
  • Defined function "evaluate_models_with_cv(models, X_train, y_train, X_test, y_test)" to evaluates multiple regression models using cross-validation and final test set performance.- Defined function "univariate_numerical_y()", to generate two plots (Histogram and Boxplot) for the numerical variable y
  • Modeling data (data) will be done over a copy of prepared data (df6)
  • The dataset contains numerical features with different scales, which may affect algorithms sensitive to scale.
  • Several models will be tried, including models that rely on distance-based calculations (Logistic Regression, SVM, KNN) that perform better with standardized data, and also linear models (Linear/Logistic Regression, Ridge, Lasso) that can converge faster with standardized inputs.
  • Due the different scales, and models to be evaluated, the data will be standarized:
    • 'price' is the target variable. Standardizing the target (y) is not necessary for most regression models
    • 'rooms' and 'bathrooms' show a discrete distribution, which has peaks at certain integer values. No scaling considered.
    • cathegorical or binary variables such as 'lift' , 'terrace', 'real_state' and 'neighborhood' do not need scaling.
    • 'square_meters' and 'square_meters_price' have right-skewed distributions and will be transformed using PowerTransformer (Yeo-Johnson) before applying StandardScaler.

6. Evaluation¶

Assessing the model's performance using metrics such as accuracy, precision, recall, or others relevant to the project. Ensuring the model meets the required standards for deployment.

Regression Models¶

In [102]:
# Define a dictionary of regression models
regression_models = {
    "Linear Regression": LinearRegression(),
    "Lasso Regression": Lasso(),
    "Ridge Regression": Ridge(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest": RandomForestRegressor(),
    "K-Nearest Neighbors": KNeighborsRegressor(),
    "Support Vector Regressor": SVR()
}
  • Models to be tested are : Linear Regression, Lasso Regression, Ridge Regression, Decision Tree, Random Forest, K-Nearest Neighbors, and Support Vector Regressor
In [103]:
# Initialize an empty DataFrame to store results
results_df = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])
  • Performance Metrics:
    • MAE (Mean Absolute Error): Measures the average magnitude of errors in a set of predictions, without considering their direction.
    • MSE (Mean Squared Error): Measures the average of the squares of the errors, giving more weight to larger errors.
    • RMSE (Root Mean Squared Error): The square root of MSE, providing error in the same units as the target variable.
    • R2 Score (Coefficient of Determination): Indicates how well the model's predictions approximate the real data points. A value closer to 1 indicates a better fit.
In [104]:
%%time
# Loop through each model, train it, evaluate it, and store results
for model_name, model in regression_models.items():
    model.fit(X_train, y_train)
    metrics = evaluate_model(model, X_test, y_test)
    metrics["Model"] = model_name  # Add model name for reference
    results_df = pd.concat([results_df, pd.DataFrame([metrics])], ignore_index=True)
CPU times: total: 17.3 s
Wall time: 19.4 s
In [105]:
# Display the results DataFrame
results_df.sort_values(by="R2 Score", ascending=False)
Out[105]:
Model MAE MSE RMSE R2 Score
4 Random Forest 39.804324 4187.071421 64.707584 0.970026
3 Decision Tree 50.510488 7150.292357 84.559401 0.948814
2 Ridge Regression 67.054647 9179.684362 95.810669 0.934286
0 Linear Regression 67.057076 9180.541068 95.815140 0.934280
1 Lasso Regression 67.493485 9249.945144 96.176635 0.933783
5 K-Nearest Neighbors 74.951133 11475.952189 107.125871 0.917848
6 Support Vector Regressor 98.720296 23091.488357 151.958838 0.834697
In [106]:
results_df.sort_values(by="MAE")
Out[106]:
Model MAE MSE RMSE R2 Score
4 Random Forest 39.804324 4187.071421 64.707584 0.970026
3 Decision Tree 50.510488 7150.292357 84.559401 0.948814
2 Ridge Regression 67.054647 9179.684362 95.810669 0.934286
0 Linear Regression 67.057076 9180.541068 95.815140 0.934280
1 Lasso Regression 67.493485 9249.945144 96.176635 0.933783
5 K-Nearest Neighbors 74.951133 11475.952189 107.125871 0.917848
6 Support Vector Regressor 98.720296 23091.488357 151.958838 0.834697
In [107]:
results_df.sort_values(by="MSE")
Out[107]:
Model MAE MSE RMSE R2 Score
4 Random Forest 39.804324 4187.071421 64.707584 0.970026
3 Decision Tree 50.510488 7150.292357 84.559401 0.948814
2 Ridge Regression 67.054647 9179.684362 95.810669 0.934286
0 Linear Regression 67.057076 9180.541068 95.815140 0.934280
1 Lasso Regression 67.493485 9249.945144 96.176635 0.933783
5 K-Nearest Neighbors 74.951133 11475.952189 107.125871 0.917848
6 Support Vector Regressor 98.720296 23091.488357 151.958838 0.834697
In [108]:
results_df.sort_values(by="RMSE")
Out[108]:
Model MAE MSE RMSE R2 Score
4 Random Forest 39.804324 4187.071421 64.707584 0.970026
3 Decision Tree 50.510488 7150.292357 84.559401 0.948814
2 Ridge Regression 67.054647 9179.684362 95.810669 0.934286
0 Linear Regression 67.057076 9180.541068 95.815140 0.934280
1 Lasso Regression 67.493485 9249.945144 96.176635 0.933783
5 K-Nearest Neighbors 74.951133 11475.952189 107.125871 0.917848
6 Support Vector Regressor 98.720296 23091.488357 151.958838 0.834697
  • Random Forest metrics: Lowest MAE, lowest RMSE, and highest R².
  • Random Forest is the best performer overall, indicating strong predictive accuracy and low error.
  • Decision Tree metrics: Moderate errors with a good R².
  • Decision Tree is a strong candidate, although slightly behind Random Forest.
  • Ridge, Linear, and Lasso Regression metrics are consistent with each other, but their performance is noticeably lower than the tree-based methods. They might not be ideal for further tuning if the goal is the best predictive performance.
  • For hyperparameter tuning and further validation, Random Forest and Decision Tree stand out as the best candidates due to their superior performance metrics.
  • While the linear models (Ridge, Linear, and Lasso) can serve as strong baselines, they do not match the predictive accuracy of the tree-based models.
  • K-Nearest Neighbors and SVR appear less promising for further development on this dataset.

Feature Engineering¶

In [109]:
# Define the model with the selected hyperparameters
RandomForest = RandomForestRegressor()

# Train the model on the entire training dataset
RandomForest.fit(X_train, y_train)

# Feature importance
feature_importances = pd.Series(RandomForest.feature_importances_, index=X_train.columns)
feature_importances = feature_importances.sort_values(ascending=False)

# Plotting
plt.figure(figsize=(10, 6))
feature_importances.plot(kind='bar')
plt.title('Feature Importance')
plt.xlabel('Features')
plt.ylabel('Importance Score')
plt.show()
No description has been provided for this image
  • From the feature importance plot, square_meters is the most significant variable, followed by square_meters_price.
  • Since price is directly derived from square_meters * square_meters_price, including both may not add new information and could introduce redundancy.
  • It makes no sence to ask end user square_meters and square_meters_price to "predict" price.
  • NEW MODELS will be evaluated, with the feature square_meters_price DROPED from the data
In [110]:
# Drop the constant column
X_train_vif = X_train.drop(columns=['const'])
In [111]:
vif_series = pd.Series(
    [variance_inflation_factor(X_train_vif.values, i) for i in range(X_train_vif.shape[1])],
    index=X_train_vif.columns,
    dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: 

rooms                           11.080074
bathroom                        11.296229
lift                             3.449079
terrace                          1.346301
square_meters                    2.099052
square_meters_price              1.568111
real_state_attic                 1.352480
real_state_flat                  9.137370
real_state_study                 1.165991
neighborhood_Eixample            2.586087
neighborhood_Les Corts           1.328271
neighborhood_Sant Martí          1.425954
neighborhood_Ciutat Vella        1.887352
neighborhood_Gràcia              1.483803
neighborhood_Sants-Montjuïc      1.441272
neighborhood_Sant Andreu         1.144297
neighborhood_Horta- Guinardo     1.242755
neighborhood_Nou Barris          1.077936
dtype: float64

  • Although its VIF (1.568) is low (suggesting no strong collinearity within the dataset), the mathematical dependence between square_meters and square_meters_price suggests redundancy.
  • This means the model could overestimate the importance of one feature over another and lead to unstable coefficient estimates.
  • By keeping only square_meters, the model remains more interpretable, focusing on how space affects price rather than a derived variable.
  • Noted features 'rooms' and 'bathroom' present high multicolinearity and will be also droped from modeling
In [112]:
def preprocess_data(data, target_feature, drop_features, scale_features, test_size=0.30, random_state=1):
    """
    Preprocesses the dataset by handling categorical variables, boolean conversion,
    splitting data, transforming skewed features, standardizing, and adding a constant.
    
    Parameters:
    - data: DataFrame containing the full dataset.
    - target_feature: Name of the dependent variable.
    - drop_features: List of features to drop from the dataset.
    - scale_features: List of numerical features to transform and scale.
    - test_size: Proportion of the dataset to include in the test split.
    - random_state: Seed for reproducibility.
    
    Returns:
    - X_train, X_test, y_train, y_test: Processed training and test datasets.
    """
    # 1. Specify independent (X) and dependent (y) variables
    X = data.drop(drop_features, axis=1)
    y = data[target_feature]
    
    # 2. Create dummy variables for categorical features
    categorical_features = ['real_state', 'neighborhood']
    X = pd.get_dummies(X, columns=categorical_features, drop_first=True)
    
    # 3. Convert boolean columns to numeric (0 and 1)
    bool_cols = X.select_dtypes(['bool']).columns
    X[bool_cols] = X[bool_cols].astype(int)
    
    # 4. Split data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    
    # 5. Transform and scale right-skewed variables (PowerTransformer for skewed data)
    pt = PowerTransformer(method='yeo-johnson')
    X_train[scale_features] = pt.fit_transform(X_train[scale_features])
    X_test[scale_features] = pt.transform(X_test[scale_features])
    
    # 6. Standardize the transformed numerical features
    scaler = StandardScaler()
    X_train[scale_features] = scaler.fit_transform(X_train[scale_features])
    X_test[scale_features] = scaler.transform(X_test[scale_features])
    
    # 7. Add a constant to independent variables (after scaling)
    X_train = sm.add_constant(X_train)
    X_test = sm.add_constant(X_test)
    
    return X_train, X_test, y_train, y_test
  • Defined function "preprocess_data(data, target_feature, drop_features, scale_features, test_size=0.30, random_state=1)", to iterate on the data preparation for modeling
In [113]:
X_train, X_test, y_train, y_test= preprocess_data(data, ['price'], ['price','square_meters_price'], ['square_meters'], test_size=0.30, random_state=1)
  • Data preparation droping the feature square_meters_price
In [121]:
# Initialize an empty DataFrame to store results
results_df = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])
In [122]:
%%time
# Loop through each model, train it, evaluate it, and store results
for model_name, model in regression_models.items():
    model.fit(X_train, y_train)
    metrics = evaluate_model(model, X_test, y_test)
    metrics["Model"] = model_name  # Add model name for reference
    results_df = pd.concat([results_df, pd.DataFrame([metrics])], ignore_index=True)
CPU times: total: 14.5 s
Wall time: 16.4 s
In [123]:
# Display the results DataFrame
results_df.sort_values(by="R2 Score", ascending=False)
Out[123]:
Model MAE MSE RMSE R2 Score
0 Linear Regression 197.545690 65051.360633 255.051682 0.534324
2 Ridge Regression 197.554341 65051.439986 255.051838 0.534323
1 Lasso Regression 198.916759 65589.695703 256.104853 0.530470
5 K-Nearest Neighbors 198.173137 69939.496099 264.460765 0.499331
4 Random Forest 198.947624 70835.540589 266.149470 0.492917
6 Support Vector Regressor 211.486036 86079.067618 293.392344 0.383795
3 Decision Tree 226.355418 97971.676701 313.004276 0.298660
  • Linear Regression and Ridge Regression performed the best in terms of R² Score
  • Feature selection will be performed to reduce multicollinearity.
In [124]:
# Drop the constant column
X_train_vif = X_train.drop(columns=['const'])

vif_series = pd.Series(
    [variance_inflation_factor(X_train_vif.values, i) for i in range(X_train_vif.shape[1])],
    index=X_train_vif.columns,
    dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: 

bathroom                        8.222008
lift                            3.418865
terrace                         1.338489
square_meters                   1.354581
real_state_attic                1.323041
real_state_flat                 7.304896
real_state_study                1.144434
neighborhood_Eixample           2.459688
neighborhood_Les Corts          1.300169
neighborhood_Sant Martí         1.382281
neighborhood_Ciutat Vella       1.849081
neighborhood_Gràcia             1.440009
neighborhood_Sants-Montjuïc     1.393866
neighborhood_Sant Andreu        1.121367
neighborhood_Horta- Guinardo    1.201621
neighborhood_Nou Barris         1.063670
dtype: float64

In [125]:
X_train, X_test, y_train, y_test= preprocess_data(data, ['price'], ['price','square_meters_price','rooms'], ['square_meters'], test_size=0.30, random_state=1)
  • Data preparation droping the feature 'rooms' due high multicolinearity
In [126]:
# Initialize an empty DataFrame to store results
results_df = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])
In [127]:
%%time
# Loop through each model, train it, evaluate it, and store results
for model_name, model in regression_models.items():
    model.fit(X_train, y_train)
    metrics = evaluate_model(model, X_test, y_test)
    metrics["Model"] = model_name  # Add model name for reference
    results_df = pd.concat([results_df, pd.DataFrame([metrics])], ignore_index=True)
CPU times: total: 15.1 s
Wall time: 16.6 s
In [128]:
# Display the results DataFrame
results_df.sort_values(by="R2 Score", ascending=False)
Out[128]:
Model MAE MSE RMSE R2 Score
0 Linear Regression 197.545690 65051.360633 255.051682 0.534324
2 Ridge Regression 197.554341 65051.439986 255.051838 0.534323
1 Lasso Regression 198.916759 65589.695703 256.104853 0.530470
5 K-Nearest Neighbors 198.173137 69939.496099 264.460765 0.499331
4 Random Forest 199.230373 71217.270390 266.865641 0.490184
6 Support Vector Regressor 211.486036 86079.067618 293.392344 0.383795
3 Decision Tree 227.111963 98476.792928 313.810122 0.295044
In [129]:
# Drop the constant column
X_train_vif = X_train.drop(columns=['const'])

vif_series = pd.Series(
    [variance_inflation_factor(X_train_vif.values, i) for i in range(X_train_vif.shape[1])],
    index=X_train_vif.columns,
    dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: 

bathroom                        8.222008
lift                            3.418865
terrace                         1.338489
square_meters                   1.354581
real_state_attic                1.323041
real_state_flat                 7.304896
real_state_study                1.144434
neighborhood_Eixample           2.459688
neighborhood_Les Corts          1.300169
neighborhood_Sant Martí         1.382281
neighborhood_Ciutat Vella       1.849081
neighborhood_Gràcia             1.440009
neighborhood_Sants-Montjuïc     1.393866
neighborhood_Sant Andreu        1.121367
neighborhood_Horta- Guinardo    1.201621
neighborhood_Nou Barris         1.063670
dtype: float64

  • After removing feature 'rooms' still Linear Regression and Ridge Regression performed the best in terms of R² Score, but also remains features with high multicolinearity
In [130]:
X_train, X_test, y_train, y_test= preprocess_data(data, ['price'], ['price','square_meters_price','rooms','bathroom'], ['square_meters'], test_size=0.30, random_state=1)
  • Data preparation droping the feature 'bahtroom' due high multicolinearity
In [131]:
# Initialize an empty DataFrame to store results
results_df = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])
In [132]:
%%time
# Loop through each model, train it, evaluate it, and store results
for model_name, model in regression_models.items():
    model.fit(X_train, y_train)
    metrics = evaluate_model(model, X_test, y_test)
    metrics["Model"] = model_name  # Add model name for reference
    results_df = pd.concat([results_df, pd.DataFrame([metrics])], ignore_index=True)
CPU times: total: 15.2 s
Wall time: 17.2 s
In [135]:
# Display the results DataFrame
results_df.sort_values(by="R2 Score", ascending=False)
Out[135]:
Model MAE MSE RMSE R2 Score
2 Ridge Regression 203.954404 68258.459986 261.263201 0.511365
0 Linear Regression 203.946460 68259.489536 261.265171 0.511358
1 Lasso Regression 204.980574 68760.982886 262.223155 0.507768
5 K-Nearest Neighbors 205.304975 74808.167353 273.510818 0.464479
4 Random Forest 203.978473 74835.714760 273.561172 0.464281
6 Support Vector Regressor 216.002731 89473.683618 299.121520 0.359494
3 Decision Tree 233.919806 105994.254542 325.567588 0.241230
In [136]:
# Drop the constant column
X_train_vif = X_train.drop(columns=['const'])

vif_series = pd.Series(
    [variance_inflation_factor(X_train_vif.values, i) for i in range(X_train_vif.shape[1])],
    index=X_train_vif.columns,
    dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: 

lift                            3.240716
terrace                         1.335683
square_meters                   1.100341
real_state_attic                1.259749
real_state_flat                 5.435033
real_state_study                1.099832
neighborhood_Eixample           2.160766
neighborhood_Les Corts          1.248794
neighborhood_Sant Martí         1.317437
neighborhood_Ciutat Vella       1.622886
neighborhood_Gràcia             1.365185
neighborhood_Sants-Montjuïc     1.317733
neighborhood_Sant Andreu        1.107205
neighborhood_Horta- Guinardo    1.170314
neighborhood_Nou Barris         1.056992
dtype: float64

  • Remains the feature real_state_flat with VIF>5
  • Since "flat" is the most frequent category across neighborhoods, it might be highly correlated with certain neighborhood variables.
  • Instead of removing real_state_flat, it will be considered as the Baseline Category for real_state
In [137]:
def preprocess_data(data, target_feature, drop_features, scale_features, categorical_features, baseline_categories, test_size=0.30, random_state=1):
    """
    Preprocesses the dataset by handling categorical variables, boolean conversion,
    splitting data, transforming skewed features, standardizing, and adding a constant.
    
    Parameters:
    - data: DataFrame containing the full dataset.
    - target_feature: Name of the dependent variable.
    - drop_features: List of features to drop.
    - scale_features: List of numerical features to transform and scale.
    - categorical_features: List of categorical features to encode.
    - baseline_categories: Dictionary specifying baseline category for each categorical variable.
    - test_size: Proportion of the dataset to include in the test split.
    - random_state: Seed for reproducibility.
    
    Returns:
    - X_train, X_test, y_train, y_test: Processed training and test datasets.
    """
    # 1. Specify independent (X) and dependent (y) variables
    X = data.drop([target_feature] + drop_features, axis=1)
    y = data[target_feature]
    
    # 2. Create dummy variables for categorical features with specified baseline categories
    X = pd.get_dummies(X, columns=categorical_features, drop_first=False)
    for feature, baseline in baseline_categories.items():
        if f"{feature}_{baseline}" in X.columns:
            X.drop(columns=[f"{feature}_{baseline}"], inplace=True)
    
    # 3. Convert boolean columns to numeric (0 and 1)
    bool_cols = X.select_dtypes(['bool']).columns
    X[bool_cols] = X[bool_cols].astype(int)
    
    # 4. Split data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    
    # 5. Transform and scale right-skewed variables (PowerTransformer for skewed data)
    pt = PowerTransformer(method='yeo-johnson')
    X_train[scale_features] = pt.fit_transform(X_train[scale_features])
    X_test[scale_features] = pt.transform(X_test[scale_features])
    
    # 6. Standardize the transformed numerical features
    scaler = StandardScaler()
    X_train[scale_features] = scaler.fit_transform(X_train[scale_features])
    X_test[scale_features] = scaler.transform(X_test[scale_features])
    
    # 7. Add a constant to independent variables (after scaling)
    X_train = sm.add_constant(X_train)
    X_test = sm.add_constant(X_test)
    
    return X_train, X_test, y_train, y_test
  • Modified preprocess_data function to control one-hot encoding category to drop
In [138]:
plot_crosstab_heat_perc(df6, var_interest='real_state',df_name="prepared data")
No description has been provided for this image
  • Selected real_state_flat and neighborhood_Eixample as the base line categories for one-hot encoding
In [139]:
X_train, X_test, y_train, y_test = preprocess_data(
    data=data, 
    target_feature="price", 
    drop_features=["price", "square_meters_price", "rooms", "bathroom"],
    scale_features=["square_meters"], 
    categorical_features=["real_state", "neighborhood"], 
    baseline_categories={"real_state": "flat", "neighborhood": "Eixample"}, 
    test_size=0.30, 
    random_state=1
)
In [140]:
# Initialize an empty DataFrame to store results
results_df = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])
In [141]:
%%time
# Loop through each model, train it, evaluate it, and store results
for model_name, model in regression_models.items():
    model.fit(X_train, y_train)
    metrics = evaluate_model(model, X_test, y_test)
    metrics["Model"] = model_name  # Add model name for reference
    results_df = pd.concat([results_df, pd.DataFrame([metrics])], ignore_index=True)
CPU times: total: 20.5 s
Wall time: 29.4 s
In [142]:
# Display the results DataFrame
results_df.sort_values(by="R2 Score", ascending=False)
Out[142]:
Model MAE MSE RMSE R2 Score
0 Linear Regression 203.946460 68259.489536 261.265171 0.511358
2 Ridge Regression 203.959557 68260.799293 261.267677 0.511349
1 Lasso Regression 205.169206 68709.866263 262.125669 0.508134
4 Random Forest 203.172925 74554.912118 273.047454 0.466292
5 K-Nearest Neighbors 204.794534 74634.435609 273.193037 0.465722
6 Support Vector Regressor 216.052970 89802.139881 299.670052 0.357143
3 Decision Tree 234.987992 106660.526170 326.589232 0.236460
In [143]:
# Drop the constant column
X_train_vif = X_train.drop(columns=['const'])

vif_series = pd.Series(
    [variance_inflation_factor(X_train_vif.values, i) for i in range(X_train_vif.shape[1])],
    index=X_train_vif.columns,
    dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: 

lift                                2.020784
terrace                             1.325467
square_meters                       1.103906
real_state_apartment                1.082000
real_state_attic                    1.093367
real_state_study                    1.049390
neighborhood_Sarria-Sant Gervasi    1.255990
neighborhood_Les Corts              1.098590
neighborhood_Sant Martí             1.121619
neighborhood_Ciutat Vella           1.207858
neighborhood_Gràcia                 1.118434
neighborhood_Sants-Montjuïc         1.105899
neighborhood_Sant Andreu            1.037034
neighborhood_Horta- Guinardo        1.053019
neighborhood_Nou Barris             1.018164
dtype: float64

  • There is no multicolinearity in the data, suggesting the real state distribution in terms of number of rooms and bathrooms is not as relevant as the real state area, type and neighborhood
  • Linear Regression and Ridge Regression are the best models among those tested, but R² scores suggest that the models are not explaining a significant portion of the variance in the target variable.
  • More advanced models will be included in the evaluation

Advanced Regression Models¶

  • Models to be tested are: DecisionTree_Tuned_1, RandomForest_Tuned_1, GradientBoosting_Tuned_1, XGBoost_Tuned_1, LightGBM_Tuned_1, NeuralNetwork(MLP)
In [147]:
# Define a dictionary of regression models
regression_models_2 = {
    "DecisionTree_Tuned_1": DecisionTreeRegressor(max_depth=10, min_samples_split=5),
    "RandomForest_Tuned_1": RandomForestRegressor(max_depth=10, min_samples_split=5, n_estimators=200),
    "GradientBoosting_Tuned_1": GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=5),
    "XGBoost_Tuned_1": xgb.XGBRegressor(n_estimators=200, learning_rate=0.1, max_depth=5),
    "LightGBM_Tuned_1": lgb.LGBMRegressor(n_estimators=200, learning_rate=0.1, max_depth=5, verbose=-1),
    #"CatBoost": catb.CatBoostRegressor(iterations=200, learning_rate=0.1, depth=5, verbose=0),
    "NeuralNetwork(MLP)": MLPRegressor(hidden_layer_sizes=(100,), activation='relu', solver='adam', max_iter=500)
}
In [148]:
# Initialize an empty DataFrame to store results
results_df_2 = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])
In [149]:
%%time
# Loop through each model, train it, evaluate it, and store results
for model_name, model in regression_models_2.items():
    model.fit(X_train, y_train)
    metrics = evaluate_model(model, X_test, y_test)
    metrics["Model"] = model_name  # Add model name for reference
    results_df_2 = pd.concat([results_df_2, pd.DataFrame([metrics])], ignore_index=True)
CPU times: total: 22.8 s
Wall time: 23.8 s
In [150]:
# Display the results DataFrame
results_df_2.sort_values(by="R2 Score", ascending=False)
Out[150]:
Model MAE MSE RMSE R2 Score
2 GradientBoosting_Tuned_1 191.041670 63583.720750 252.158126 0.544830
4 LightGBM_Tuned_1 191.308692 63653.691737 252.296833 0.544329
3 XGBoost_Tuned_1 191.172298 63903.217757 252.790858 0.542543
1 RandomForest_Tuned_1 192.136590 64281.579837 253.538123 0.539834
5 NeuralNetwork(MLP) 201.391227 66855.660295 258.564615 0.521407
0 DecisionTree_Tuned_1 197.919800 70045.064728 264.660282 0.498576
In [151]:
# Display the results DataFrame
results_df.sort_values(by="R2 Score", ascending=False)
Out[151]:
Model MAE MSE RMSE R2 Score
0 Linear Regression 203.946460 68259.489536 261.265171 0.511358
2 Ridge Regression 203.959557 68260.799293 261.267677 0.511349
1 Lasso Regression 205.169206 68709.866263 262.125669 0.508134
4 Random Forest 203.172925 74554.912118 273.047454 0.466292
5 K-Nearest Neighbors 204.794534 74634.435609 273.193037 0.465722
6 Support Vector Regressor 216.052970 89802.139881 299.670052 0.357143
3 Decision Tree 234.987992 106660.526170 326.589232 0.236460
  • The best R2 score from the advanced models is currently 0.5448 with the Gradient Boosting model.
  • Improving from 0.5113 Linear Regression could be a good start, but could potentially be improved further with model tuning

Model Tuning¶

In [152]:
def tune_gradient_boosting():
    print("Tuning Gradient Boosting...")
    param_grid = {
        'n_estimators': [100, 200, 300, 500],
        'learning_rate': [0.01, 0.05, 0.1, 0.2],
        'max_depth': [3, 5, 7, 9],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
        'subsample': [0.8, 0.9, 1.0],
        'max_features': ['sqrt', 'log2', None]
    }
    
    gb = GradientBoostingRegressor(random_state=42)
    grid_search = RandomizedSearchCV(
        estimator=gb,
        param_distributions=param_grid,
        n_iter=20,
        cv=5,
        scoring='r2',
        n_jobs=-1,
        random_state=42,
        verbose=1
    )
    
    grid_search.fit(X_train, y_train)
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best R2 score: {grid_search.best_score_:.4f}")
    best_gb = grid_search.best_estimator_
    
    return best_gb
In [153]:
tune_gradient_boosting()
Tuning Gradient Boosting...
Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best parameters: {'subsample': 0.8, 'n_estimators': 100, 'min_samples_split': 10, 'min_samples_leaf': 4, 'max_features': None, 'max_depth': 3, 'learning_rate': 0.1}
Best R2 score: 0.5434
Out[153]:
GradientBoostingRegressor(min_samples_leaf=4, min_samples_split=10,
                          random_state=42, subsample=0.8)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingRegressor(min_samples_leaf=4, min_samples_split=10,
                          random_state=42, subsample=0.8)
In [154]:
def tune_xgboost():
    print("Tuning XGBoost...")
    param_grid = {
        'n_estimators': [100, 200, 300, 500],
        'learning_rate': [0.01, 0.05, 0.1, 0.2],
        'max_depth': [3, 5, 7, 9],
        'min_child_weight': [1, 3, 5, 7],
        'gamma': [0, 0.1, 0.2, 0.3],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'reg_alpha': [0, 0.1, 1, 10],
        'reg_lambda': [0, 1, 5, 10]
    }
    
    xgb_model = xgb.XGBRegressor(random_state=42)
    grid_search = RandomizedSearchCV(
        estimator=xgb_model,
        param_distributions=param_grid,
        n_iter=20,
        cv=5,
        scoring='r2',
        n_jobs=-1,
        random_state=42,
        verbose=1
    )
    
    grid_search.fit(X_train, y_train)
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best R2 score: {grid_search.best_score_:.4f}")
    
    best_xgb = grid_search.best_estimator_
    return best_xgb
In [155]:
tune_xgboost()
Tuning XGBoost...
Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best parameters: {'subsample': 0.6, 'reg_lambda': 0, 'reg_alpha': 10, 'n_estimators': 500, 'min_child_weight': 3, 'max_depth': 3, 'learning_rate': 0.05, 'gamma': 0.2, 'colsample_bytree': 1.0}
Best R2 score: 0.5456
Out[155]:
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=1.0, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=0.2, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.05, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=3, max_leaves=None,
             min_child_weight=3, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=500, n_jobs=None,
             num_parallel_tree=None, random_state=42, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=1.0, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=0.2, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.05, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=3, max_leaves=None,
             min_child_weight=3, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=500, n_jobs=None,
             num_parallel_tree=None, random_state=42, ...)
In [159]:
def tune_lightgbm():
    print("Tuning LightGBM...")
    param_grid = {
        'n_estimators': [100, 200, 300, 500],
        'learning_rate': [0.01, 0.05, 0.1, 0.2],
        'max_depth': [3, 5, 7, 9, -1],
        'num_leaves': [31, 50, 100, 150],
        'min_child_samples': [5, 10, 20, 50],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'reg_alpha': [0, 0.1, 1, 10],
        'reg_lambda': [0, 1, 5, 10]
    }
    
    lgb_model = lgb.LGBMRegressor(random_state=42, verbose=-1)
    grid_search = RandomizedSearchCV(
        estimator=lgb_model,
        param_distributions=param_grid,
        n_iter=20,
        cv=5,
        scoring='r2',
        n_jobs=-1,
        random_state=42,
        verbose=1
    )
    
    grid_search.fit(X_train, y_train)
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best R2 score: {grid_search.best_score_:.4f}")
    
    best_lgb = grid_search.best_estimator_
    return best_lgb
In [160]:
tune_lightgbm()
Tuning LightGBM...
Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best parameters: {'subsample': 0.6, 'reg_lambda': 1, 'reg_alpha': 10, 'num_leaves': 50, 'n_estimators': 500, 'min_child_samples': 5, 'max_depth': 3, 'learning_rate': 0.05, 'colsample_bytree': 0.6}
Best R2 score: 0.5443
Out[160]:
LGBMRegressor(colsample_bytree=0.6, learning_rate=0.05, max_depth=3,
              min_child_samples=5, n_estimators=500, num_leaves=50,
              random_state=42, reg_alpha=10, reg_lambda=1, subsample=0.6,
              verbose=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LGBMRegressor(colsample_bytree=0.6, learning_rate=0.05, max_depth=3,
              min_child_samples=5, n_estimators=500, num_leaves=50,
              random_state=42, reg_alpha=10, reg_lambda=1, subsample=0.6,
              verbose=-1)
In [161]:
import optuna
In [163]:
# ===============================================
# Advanced Hyperparameter Tuning with Optuna
# ===============================================

def tune_with_optuna(model_type):
    print(f"Tuning {model_type} with Optuna...")
    
    def objective(trial):
        if model_type == 'xgboost':
            params = {
                'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
                'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
                'max_depth': trial.suggest_int('max_depth', 3, 10),
                'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
                'gamma': trial.suggest_float('gamma', 0, 1),
                'subsample': trial.suggest_float('subsample', 0.5, 1.0),
                'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
                'reg_alpha': trial.suggest_float('reg_alpha', 0, 10),
                'reg_lambda': trial.suggest_float('reg_lambda', 0, 10),
                'random_state': 42
            }
            model = xgb.XGBRegressor(**params)
        
        elif model_type == 'lightgbm':
            params = {
                'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
                'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
                'max_depth': trial.suggest_int('max_depth', 3, 10),
                'num_leaves': trial.suggest_int('num_leaves', 20, 200),
                'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
                'subsample': trial.suggest_float('subsample', 0.5, 1.0),
                'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
                'reg_alpha': trial.suggest_float('reg_alpha', 0, 10),
                'reg_lambda': trial.suggest_float('reg_lambda', 0, 10),
                'random_state': 42,
                'verbose': -1
            }
            model = lgb.LGBMRegressor(**params)
        
        elif model_type == 'gbr':
            params = {
                'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
                'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
                'max_depth': trial.suggest_int('max_depth', 3, 10),
                'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
                'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
                'subsample': trial.suggest_float('subsample', 0.5, 1.0),
                'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2', None]),
                'random_state': 42
            }
            model = GradientBoostingRegressor(**params)
        
        else:
            raise ValueError(f"Unknown model type: {model_type}")
            
        # Use cross-validation for more robust evaluation
        kf = KFold(n_splits=5, shuffle=True, random_state=42)
        scores = cross_val_score(model, X_train, y_train, cv=kf, scoring='r2')
        return scores.mean()
    
    # Create and optimize the study
    study = optuna.create_study(direction='maximize')
    study.optimize(objective, n_trials=50)
    
    print(f"Best trial: {study.best_trial.number}")
    print(f"Best R2 score: {study.best_value:.4f}")
    print(f"Best parameters: {study.best_params}")
    
    # Create a model with the best parameters
    if model_type == 'xgboost':
        best_model = xgb.XGBRegressor(**study.best_params)
    elif model_type == 'lightgbm':
        best_model = lgb.LGBMRegressor(**study.best_params)
    elif model_type == 'gbr':
        best_model = GradientBoostingRegressor(**study.best_params)
    
    # Train and evaluate on test set
    best_model.fit(X_train, y_train)
    #metrics = evaluate_model(best_model, X_train, X_test, y_train, y_test)
    #print(f"Test R2 score: {metrics['R2 Score']:.4f}")
    
    return best_model
In [164]:
tune_with_optuna('xgboost')
[I 2025-02-28 18:26:08,821] A new study created in memory with name: no-name-c3bea523-3901-49e5-bc06-10a192dccd4c
Tuning xgboost with Optuna...
[I 2025-02-28 18:26:24,163] Trial 0 finished with value: 0.48917752504348755 and parameters: {'n_estimators': 888, 'learning_rate': 0.13638048024350496, 'max_depth': 8, 'min_child_weight': 8, 'gamma': 0.06434676023791375, 'subsample': 0.7165790736450253, 'colsample_bytree': 0.5121990183798227, 'reg_alpha': 5.746612857926817, 'reg_lambda': 6.923761570557943}. Best is trial 0 with value: 0.48917752504348755.
[I 2025-02-28 18:26:30,704] Trial 1 finished with value: 0.48238654136657716 and parameters: {'n_estimators': 393, 'learning_rate': 0.09384983663772839, 'max_depth': 10, 'min_child_weight': 4, 'gamma': 0.5514728804308722, 'subsample': 0.5981765793087479, 'colsample_bytree': 0.8075428855306195, 'reg_alpha': 9.307261576083217, 'reg_lambda': 7.241394804967402}. Best is trial 0 with value: 0.48917752504348755.
[I 2025-02-28 18:26:34,803] Trial 2 finished with value: 0.5223005294799805 and parameters: {'n_estimators': 407, 'learning_rate': 0.11871603858664527, 'max_depth': 6, 'min_child_weight': 2, 'gamma': 0.6967152026076723, 'subsample': 0.8846920736377173, 'colsample_bytree': 0.5133437318094024, 'reg_alpha': 0.8649785036720237, 'reg_lambda': 5.878119849951487}. Best is trial 2 with value: 0.5223005294799805.
[I 2025-02-28 18:26:40,994] Trial 3 finished with value: 0.50477694272995 and parameters: {'n_estimators': 665, 'learning_rate': 0.15214036925300578, 'max_depth': 7, 'min_child_weight': 8, 'gamma': 0.9031770410225999, 'subsample': 0.681241311018806, 'colsample_bytree': 0.5141291135344452, 'reg_alpha': 4.33395472050038, 'reg_lambda': 9.411524376239278}. Best is trial 2 with value: 0.5223005294799805.
[I 2025-02-28 18:26:45,903] Trial 4 finished with value: 0.5029198408126831 and parameters: {'n_estimators': 700, 'learning_rate': 0.2995516278226636, 'max_depth': 4, 'min_child_weight': 3, 'gamma': 0.1306726528413399, 'subsample': 0.9542975672262177, 'colsample_bytree': 0.747427421816002, 'reg_alpha': 2.987130049785515, 'reg_lambda': 3.2037729939708024}. Best is trial 2 with value: 0.5223005294799805.
[I 2025-02-28 18:26:52,834] Trial 5 finished with value: 0.42719292640686035 and parameters: {'n_estimators': 674, 'learning_rate': 0.29003915104330136, 'max_depth': 6, 'min_child_weight': 3, 'gamma': 0.9274640608616926, 'subsample': 0.988081540402282, 'colsample_bytree': 0.8941856867048612, 'reg_alpha': 0.3263042730315746, 'reg_lambda': 1.4852147107431324}. Best is trial 2 with value: 0.5223005294799805.
[I 2025-02-28 18:27:01,629] Trial 6 finished with value: 0.4879873275756836 and parameters: {'n_estimators': 952, 'learning_rate': 0.15389715502445497, 'max_depth': 6, 'min_child_weight': 5, 'gamma': 0.9090277514328283, 'subsample': 0.897013687505599, 'colsample_bytree': 0.5050074177242521, 'reg_alpha': 4.353580911570742, 'reg_lambda': 1.4369264223904599}. Best is trial 2 with value: 0.5223005294799805.
[I 2025-02-28 18:27:04,869] Trial 7 finished with value: 0.5319819688796997 and parameters: {'n_estimators': 421, 'learning_rate': 0.2022139181760185, 'max_depth': 3, 'min_child_weight': 2, 'gamma': 0.9147586121490527, 'subsample': 0.6467542388459349, 'colsample_bytree': 0.9331485843647656, 'reg_alpha': 5.870927504162713, 'reg_lambda': 7.474268183728644}. Best is trial 7 with value: 0.5319819688796997.
[I 2025-02-28 18:27:06,070] Trial 8 finished with value: 0.543352198600769 and parameters: {'n_estimators': 168, 'learning_rate': 0.07709481416442678, 'max_depth': 3, 'min_child_weight': 6, 'gamma': 0.6179366698460165, 'subsample': 0.6278972757798043, 'colsample_bytree': 0.8190965895343418, 'reg_alpha': 5.518448204079417, 'reg_lambda': 2.9877500989676453}. Best is trial 8 with value: 0.543352198600769.
[I 2025-02-28 18:27:17,093] Trial 9 finished with value: 0.44873361587524413 and parameters: {'n_estimators': 866, 'learning_rate': 0.1516948752160496, 'max_depth': 9, 'min_child_weight': 9, 'gamma': 0.6178197494002838, 'subsample': 0.9407339655954512, 'colsample_bytree': 0.6605986109530679, 'reg_alpha': 6.500892125923215, 'reg_lambda': 0.03437041373199001}. Best is trial 8 with value: 0.543352198600769.
[I 2025-02-28 18:27:18,599] Trial 10 finished with value: 0.5331526756286621 and parameters: {'n_estimators': 137, 'learning_rate': 0.02033192710934223, 'max_depth': 4, 'min_child_weight': 6, 'gamma': 0.292620892412038, 'subsample': 0.5216325280362273, 'colsample_bytree': 0.9840773811906716, 'reg_alpha': 8.155834349462769, 'reg_lambda': 3.9360639171120804}. Best is trial 8 with value: 0.543352198600769.
[I 2025-02-28 18:27:19,931] Trial 11 finished with value: 0.45643426179885865 and parameters: {'n_estimators': 106, 'learning_rate': 0.010488983845342616, 'max_depth': 4, 'min_child_weight': 6, 'gamma': 0.3049572471854668, 'subsample': 0.520442469968559, 'colsample_bytree': 0.9999609534850513, 'reg_alpha': 8.468519779713077, 'reg_lambda': 3.9708718272349968}. Best is trial 8 with value: 0.543352198600769.
[I 2025-02-28 18:27:21,183] Trial 12 finished with value: 0.5151201605796814 and parameters: {'n_estimators': 100, 'learning_rate': 0.025399498046396247, 'max_depth': 3, 'min_child_weight': 6, 'gamma': 0.36416676033415973, 'subsample': 0.5411003789643622, 'colsample_bytree': 0.8339552529517178, 'reg_alpha': 7.700643593211, 'reg_lambda': 4.036221719845212}. Best is trial 8 with value: 0.543352198600769.
[I 2025-02-28 18:27:23,985] Trial 13 finished with value: 0.5437254309654236 and parameters: {'n_estimators': 250, 'learning_rate': 0.06693226182648743, 'max_depth': 4, 'min_child_weight': 7, 'gamma': 0.3696765199367511, 'subsample': 0.794711808518893, 'colsample_bytree': 0.6865005344815053, 'reg_alpha': 9.910407108713995, 'reg_lambda': 2.2326642799541316}. Best is trial 13 with value: 0.5437254309654236.
[I 2025-02-28 18:27:26,539] Trial 14 finished with value: 0.5405412554740906 and parameters: {'n_estimators': 263, 'learning_rate': 0.06381385702672845, 'max_depth': 5, 'min_child_weight': 8, 'gamma': 0.4287584297754668, 'subsample': 0.8042622571793558, 'colsample_bytree': 0.6715707976463381, 'reg_alpha': 2.5083568190792556, 'reg_lambda': 2.2601519358002795}. Best is trial 13 with value: 0.5437254309654236.
[I 2025-02-28 18:27:28,581] Trial 15 finished with value: 0.5440799832344055 and parameters: {'n_estimators': 244, 'learning_rate': 0.07041223547960868, 'max_depth': 3, 'min_child_weight': 10, 'gamma': 0.7197881334465857, 'subsample': 0.7779178290732698, 'colsample_bytree': 0.679125411798041, 'reg_alpha': 9.569034426519458, 'reg_lambda': 0.12279704257417778}. Best is trial 15 with value: 0.5440799832344055.
[I 2025-02-28 18:27:31,539] Trial 16 finished with value: 0.540407121181488 and parameters: {'n_estimators': 275, 'learning_rate': 0.06063294345361731, 'max_depth': 5, 'min_child_weight': 10, 'gamma': 0.7365524645454472, 'subsample': 0.8110906118418999, 'colsample_bytree': 0.623141542992063, 'reg_alpha': 9.781172279316053, 'reg_lambda': 0.43143826902046456}. Best is trial 15 with value: 0.5440799832344055.
[I 2025-02-28 18:27:37,498] Trial 17 finished with value: 0.5101608753204345 and parameters: {'n_estimators': 537, 'learning_rate': 0.19241134263140267, 'max_depth': 5, 'min_child_weight': 10, 'gamma': 0.7467361833627015, 'subsample': 0.7717331640243381, 'colsample_bytree': 0.7264319782750525, 'reg_alpha': 7.115478280534627, 'reg_lambda': 1.1523235971868167}. Best is trial 15 with value: 0.5440799832344055.
[I 2025-02-28 18:27:40,731] Trial 18 finished with value: 0.5394483804702759 and parameters: {'n_estimators': 285, 'learning_rate': 0.10229676823863273, 'max_depth': 4, 'min_child_weight': 7, 'gamma': 0.18883282239766835, 'subsample': 0.8371227008001572, 'colsample_bytree': 0.5736934015173419, 'reg_alpha': 9.06265115610064, 'reg_lambda': 2.3752129472463364}. Best is trial 15 with value: 0.5440799832344055.
[I 2025-02-28 18:27:44,909] Trial 19 finished with value: 0.5307446241378784 and parameters: {'n_estimators': 500, 'learning_rate': 0.20147488495809995, 'max_depth': 3, 'min_child_weight': 9, 'gamma': 0.474104213089799, 'subsample': 0.7372191544654088, 'colsample_bytree': 0.6913039115734858, 'reg_alpha': 9.67396253041139, 'reg_lambda': 5.127762143253605}. Best is trial 15 with value: 0.5440799832344055.
[I 2025-02-28 18:27:51,513] Trial 20 finished with value: 0.5363608479499817 and parameters: {'n_estimators': 208, 'learning_rate': 0.04951744310034004, 'max_depth': 7, 'min_child_weight': 9, 'gamma': 0.8011995402540919, 'subsample': 0.8614635610779272, 'colsample_bytree': 0.6017936502418365, 'reg_alpha': 7.17655323116587, 'reg_lambda': 0.9345585130656484}. Best is trial 15 with value: 0.5440799832344055.
[I 2025-02-28 18:27:54,395] Trial 21 finished with value: 0.5442478656768799 and parameters: {'n_estimators': 206, 'learning_rate': 0.08027934373962738, 'max_depth': 3, 'min_child_weight': 5, 'gamma': 0.5832246430248952, 'subsample': 0.6088060632152655, 'colsample_bytree': 0.7688938558463965, 'reg_alpha': 8.550184417955817, 'reg_lambda': 2.4928865849773194}. Best is trial 21 with value: 0.5442478656768799.
[I 2025-02-28 18:27:57,272] Trial 22 finished with value: 0.5443528175354004 and parameters: {'n_estimators': 345, 'learning_rate': 0.045342520753788654, 'max_depth': 3, 'min_child_weight': 4, 'gamma': 0.5383740706117337, 'subsample': 0.7662572032520517, 'colsample_bytree': 0.7860995489943112, 'reg_alpha': 8.46310279421163, 'reg_lambda': 2.1457065169311202}. Best is trial 22 with value: 0.5443528175354004.
[I 2025-02-28 18:28:02,431] Trial 23 finished with value: 0.5443742871284485 and parameters: {'n_estimators': 330, 'learning_rate': 0.03571713088640459, 'max_depth': 3, 'min_child_weight': 4, 'gamma': 0.5732685763841034, 'subsample': 0.6996412869755224, 'colsample_bytree': 0.7790660629800674, 'reg_alpha': 8.540430721935605, 'reg_lambda': 0.44990861881143535}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:28:05,876] Trial 24 finished with value: 0.5413497686386108 and parameters: {'n_estimators': 333, 'learning_rate': 0.03769392614153432, 'max_depth': 5, 'min_child_weight': 4, 'gamma': 0.5418371091725079, 'subsample': 0.6964528442045947, 'colsample_bytree': 0.7945786585066853, 'reg_alpha': 8.383487409926348, 'reg_lambda': 1.801227606365993}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:28:08,732] Trial 25 finished with value: 0.5407013773918152 and parameters: {'n_estimators': 360, 'learning_rate': 0.09383578193395664, 'max_depth': 3, 'min_child_weight': 4, 'gamma': 0.6247167867730725, 'subsample': 0.5759352383219941, 'colsample_bytree': 0.8604555455328667, 'reg_alpha': 7.284304617819455, 'reg_lambda': 3.08121700847077}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:28:27,239] Trial 26 finished with value: 0.5425023078918457 and parameters: {'n_estimators': 495, 'learning_rate': 0.03787779577950362, 'max_depth': 4, 'min_child_weight': 1, 'gamma': 0.491130039215511, 'subsample': 0.6594479465952411, 'colsample_bytree': 0.7750836297619202, 'reg_alpha': 8.680157456203698, 'reg_lambda': 5.495900414210266}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:28:30,868] Trial 27 finished with value: 0.53933265209198 and parameters: {'n_estimators': 335, 'learning_rate': 0.11736887581400667, 'max_depth': 3, 'min_child_weight': 5, 'gamma': 0.8187516904595493, 'subsample': 0.6099484443041393, 'colsample_bytree': 0.7510271924612915, 'reg_alpha': 6.49646428578942, 'reg_lambda': 0.7222028147330706}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:28:37,908] Trial 28 finished with value: 0.5147324204444885 and parameters: {'n_estimators': 195, 'learning_rate': 0.2434738648611366, 'max_depth': 5, 'min_child_weight': 3, 'gamma': 0.5672785585569524, 'subsample': 0.7457721548104954, 'colsample_bytree': 0.8683547328377962, 'reg_alpha': 7.924664906694303, 'reg_lambda': 4.253379164204221}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:28:44,959] Trial 29 finished with value: 0.5225183486938476 and parameters: {'n_estimators': 456, 'learning_rate': 0.04218092096635758, 'max_depth': 8, 'min_child_weight': 5, 'gamma': 0.4394513927468202, 'subsample': 0.7209905487524098, 'colsample_bytree': 0.7212782184436121, 'reg_alpha': 6.447263727827733, 'reg_lambda': 6.134843023455369}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:28:48,194] Trial 30 finished with value: 0.5387687802314758 and parameters: {'n_estimators': 335, 'learning_rate': 0.08359348295281316, 'max_depth': 4, 'min_child_weight': 4, 'gamma': 0.6561493506066338, 'subsample': 0.5666550463672569, 'colsample_bytree': 0.7727076033000955, 'reg_alpha': 8.884417845650987, 'reg_lambda': 2.787253390415141}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:28:50,377] Trial 31 finished with value: 0.542467987537384 and parameters: {'n_estimators': 213, 'learning_rate': 0.12546995609031303, 'max_depth': 3, 'min_child_weight': 5, 'gamma': 0.6939132860420116, 'subsample': 0.7669770160461884, 'colsample_bytree': 0.7183955636632947, 'reg_alpha': 9.257910968739045, 'reg_lambda': 0.20156358306157268}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:28:55,319] Trial 32 finished with value: 0.5410897016525269 and parameters: {'n_estimators': 616, 'learning_rate': 0.05792957787817061, 'max_depth': 3, 'min_child_weight': 7, 'gamma': 0.5528689548586072, 'subsample': 0.6900205440050493, 'colsample_bytree': 0.6456551256638154, 'reg_alpha': 7.692909838687805, 'reg_lambda': 0.6841492551333954}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:29:02,763] Trial 33 finished with value: 0.45999300479888916 and parameters: {'n_estimators': 308, 'learning_rate': 0.10644845581984247, 'max_depth': 10, 'min_child_weight': 3, 'gamma': 0.8378298041242138, 'subsample': 0.7133136650658649, 'colsample_bytree': 0.7869546343704892, 'reg_alpha': 9.23133982282121, 'reg_lambda': 1.7464679446487787}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:29:05,964] Trial 34 finished with value: 0.5420541644096375 and parameters: {'n_estimators': 375, 'learning_rate': 0.07945091367909182, 'max_depth': 3, 'min_child_weight': 2, 'gamma': 0.025209625162848415, 'subsample': 0.7800664113430529, 'colsample_bytree': 0.8468592096780337, 'reg_alpha': 8.91771014584181, 'reg_lambda': 0.05031391603297647}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:29:09,280] Trial 35 finished with value: 0.5453838586807251 and parameters: {'n_estimators': 432, 'learning_rate': 0.025260200065027993, 'max_depth': 4, 'min_child_weight': 4, 'gamma': 0.7006081302005231, 'subsample': 0.6528657324270758, 'colsample_bytree': 0.7493578971046945, 'reg_alpha': 9.886845445607618, 'reg_lambda': 1.1707680158620883}. Best is trial 35 with value: 0.5453838586807251.
[I 2025-02-28 18:29:12,614] Trial 36 finished with value: 0.5451103687286377 and parameters: {'n_estimators': 443, 'learning_rate': 0.02721077650020993, 'max_depth': 4, 'min_child_weight': 4, 'gamma': 0.5700620659878068, 'subsample': 0.6460465525980261, 'colsample_bytree': 0.7552232657580044, 'reg_alpha': 8.443426859500827, 'reg_lambda': 8.766885625636503}. Best is trial 35 with value: 0.5453838586807251.
[I 2025-02-28 18:29:16,873] Trial 37 finished with value: 0.545316469669342 and parameters: {'n_estimators': 445, 'learning_rate': 0.014051132538417561, 'max_depth': 4, 'min_child_weight': 4, 'gamma': 0.9977188285355159, 'subsample': 0.665005073625462, 'colsample_bytree': 0.8194604822600398, 'reg_alpha': 3.3097653447085658, 'reg_lambda': 9.318801375913448}. Best is trial 35 with value: 0.5453838586807251.
[I 2025-02-28 18:29:24,307] Trial 38 finished with value: 0.5432230710983277 and parameters: {'n_estimators': 616, 'learning_rate': 0.012728732176007318, 'max_depth': 6, 'min_child_weight': 3, 'gamma': 0.78526687548439, 'subsample': 0.6599673073106239, 'colsample_bytree': 0.9013964364712058, 'reg_alpha': 2.6688855103571254, 'reg_lambda': 9.683807813252347}. Best is trial 35 with value: 0.5453838586807251.
[I 2025-02-28 18:29:29,578] Trial 39 finished with value: 0.5448433876037597 and parameters: {'n_estimators': 424, 'learning_rate': 0.03178572220544605, 'max_depth': 4, 'min_child_weight': 4, 'gamma': 0.975911918061199, 'subsample': 0.6436148191741508, 'colsample_bytree': 0.8176649964963084, 'reg_alpha': 3.743006884808041, 'reg_lambda': 8.864779494401295}. Best is trial 35 with value: 0.5453838586807251.
[I 2025-02-28 18:29:46,659] Trial 40 finished with value: 0.5420290708541871 and parameters: {'n_estimators': 438, 'learning_rate': 0.030064471131231757, 'max_depth': 5, 'min_child_weight': 1, 'gamma': 0.9915905148295727, 'subsample': 0.6442874248620128, 'colsample_bytree': 0.8226221366239665, 'reg_alpha': 3.4700168401844587, 'reg_lambda': 8.777796041462576}. Best is trial 35 with value: 0.5453838586807251.
[I 2025-02-28 18:29:51,130] Trial 41 finished with value: 0.5458519577980041 and parameters: {'n_estimators': 403, 'learning_rate': 0.025181939608234893, 'max_depth': 4, 'min_child_weight': 2, 'gamma': 0.9450617049891935, 'subsample': 0.6775014498470824, 'colsample_bytree': 0.7494749285293014, 'reg_alpha': 1.605804505030021, 'reg_lambda': 8.650350883129157}. Best is trial 41 with value: 0.5458519577980041.
[I 2025-02-28 18:29:55,292] Trial 42 finished with value: 0.545633852481842 and parameters: {'n_estimators': 569, 'learning_rate': 0.023855296802388265, 'max_depth': 4, 'min_child_weight': 2, 'gamma': 0.9890646565353011, 'subsample': 0.6669536484875446, 'colsample_bytree': 0.7424051756638951, 'reg_alpha': 1.5819971923894427, 'reg_lambda': 8.381041733267152}. Best is trial 41 with value: 0.5458519577980041.
[I 2025-02-28 18:30:01,203] Trial 43 finished with value: 0.5420405983924865 and parameters: {'n_estimators': 618, 'learning_rate': 0.01723386549535197, 'max_depth': 6, 'min_child_weight': 2, 'gamma': 0.8546734957885499, 'subsample': 0.668317819630224, 'colsample_bytree': 0.7437699757132172, 'reg_alpha': 1.329928302833131, 'reg_lambda': 7.7227282297985225}. Best is trial 41 with value: 0.5458519577980041.
[I 2025-02-28 18:30:05,770] Trial 44 finished with value: 0.5408416152000427 and parameters: {'n_estimators': 534, 'learning_rate': 0.05016731777932866, 'max_depth': 4, 'min_child_weight': 2, 'gamma': 0.8835635709906183, 'subsample': 0.6280066678165979, 'colsample_bytree': 0.7088640020832621, 'reg_alpha': 1.5477380341377909, 'reg_lambda': 8.21794504715965}. Best is trial 41 with value: 0.5458519577980041.
[I 2025-02-28 18:30:12,207] Trial 45 finished with value: 0.5454214453697205 and parameters: {'n_estimators': 763, 'learning_rate': 0.011788495314032659, 'max_depth': 5, 'min_child_weight': 3, 'gamma': 0.9478992875376774, 'subsample': 0.5795941144840977, 'colsample_bytree': 0.7409612230172122, 'reg_alpha': 0.482345262428165, 'reg_lambda': 6.716675509475262}. Best is trial 41 with value: 0.5458519577980041.
[I 2025-02-28 18:30:18,688] Trial 46 finished with value: 0.5449263453483582 and parameters: {'n_estimators': 798, 'learning_rate': 0.012899785159010545, 'max_depth': 5, 'min_child_weight': 1, 'gamma': 0.9487509035940388, 'subsample': 0.5764018864860738, 'colsample_bytree': 0.7392979778285507, 'reg_alpha': 0.05638133810238344, 'reg_lambda': 6.6210598087435315}. Best is trial 41 with value: 0.5458519577980041.
[I 2025-02-28 18:30:25,159] Trial 47 finished with value: 0.5420300960540771 and parameters: {'n_estimators': 732, 'learning_rate': 0.02075742254358072, 'max_depth': 5, 'min_child_weight': 3, 'gamma': 0.9425923194929283, 'subsample': 0.5504265237884219, 'colsample_bytree': 0.7056346692644623, 'reg_alpha': 1.9886329876252316, 'reg_lambda': 7.099955651679609}. Best is trial 41 with value: 0.5458519577980041.
[I 2025-02-28 18:30:35,435] Trial 48 finished with value: 0.43148418664932253 and parameters: {'n_estimators': 949, 'learning_rate': 0.2744329489887136, 'max_depth': 7, 'min_child_weight': 2, 'gamma': 0.8837822072415009, 'subsample': 0.502146947802395, 'colsample_bytree': 0.8086645056135917, 'reg_alpha': 0.6606127328061231, 'reg_lambda': 9.96597973111648}. Best is trial 41 with value: 0.5458519577980041.
[I 2025-02-28 18:30:41,309] Trial 49 finished with value: 0.5215312957763671 and parameters: {'n_estimators': 802, 'learning_rate': 0.16891002755333676, 'max_depth': 4, 'min_child_weight': 3, 'gamma': 0.980728074563575, 'subsample': 0.5898947936728967, 'colsample_bytree': 0.6542891725895289, 'reg_alpha': 1.1067781204775877, 'reg_lambda': 7.9935035748966525}. Best is trial 41 with value: 0.5458519577980041.
Best trial: 41
Best R2 score: 0.5459
Best parameters: {'n_estimators': 403, 'learning_rate': 0.025181939608234893, 'max_depth': 4, 'min_child_weight': 2, 'gamma': 0.9450617049891935, 'subsample': 0.6775014498470824, 'colsample_bytree': 0.7494749285293014, 'reg_alpha': 1.605804505030021, 'reg_lambda': 8.650350883129157}
Out[164]:
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=0.7494749285293014, device=None,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, feature_types=None, gamma=0.9450617049891935,
             grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.025181939608234893,
             max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=4, max_leaves=None,
             min_child_weight=2, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=403, n_jobs=None,
             num_parallel_tree=None, random_state=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=0.7494749285293014, device=None,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, feature_types=None, gamma=0.9450617049891935,
             grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.025181939608234893,
             max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=4, max_leaves=None,
             min_child_weight=2, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=403, n_jobs=None,
             num_parallel_tree=None, random_state=None, ...)
In [165]:
tune_with_optuna('lightgbm')
[I 2025-02-28 18:33:22,410] A new study created in memory with name: no-name-907c1fb3-8399-4bed-bbc6-3b54147e7028
Tuning lightgbm with Optuna...
[I 2025-02-28 18:33:27,720] Trial 0 finished with value: 0.5326662266336206 and parameters: {'n_estimators': 629, 'learning_rate': 0.19440969104676345, 'max_depth': 6, 'num_leaves': 75, 'min_child_samples': 85, 'subsample': 0.6848108451598178, 'colsample_bytree': 0.5040101283257208, 'reg_alpha': 1.494238109285655, 'reg_lambda': 2.8567483167100804}. Best is trial 0 with value: 0.5326662266336206.
[I 2025-02-28 18:33:28,178] Trial 1 finished with value: 0.5410214784890865 and parameters: {'n_estimators': 150, 'learning_rate': 0.14099869668804374, 'max_depth': 4, 'num_leaves': 79, 'min_child_samples': 36, 'subsample': 0.8130008038302748, 'colsample_bytree': 0.7693373361439555, 'reg_alpha': 6.305577356850068, 'reg_lambda': 4.232341972652249}. Best is trial 1 with value: 0.5410214784890865.
[I 2025-02-28 18:33:31,569] Trial 2 finished with value: 0.5377573182053613 and parameters: {'n_estimators': 779, 'learning_rate': 0.07472664147668032, 'max_depth': 4, 'num_leaves': 155, 'min_child_samples': 42, 'subsample': 0.9709402817685133, 'colsample_bytree': 0.7252879942027699, 'reg_alpha': 8.54104157618054, 'reg_lambda': 6.190736802834839}. Best is trial 1 with value: 0.5410214784890865.
[I 2025-02-28 18:33:35,420] Trial 3 finished with value: 0.5303618511898678 and parameters: {'n_estimators': 659, 'learning_rate': 0.11377300036644683, 'max_depth': 10, 'num_leaves': 90, 'min_child_samples': 98, 'subsample': 0.67708965639584, 'colsample_bytree': 0.6419376081350582, 'reg_alpha': 1.5040045119179457, 'reg_lambda': 6.837348145437019}. Best is trial 1 with value: 0.5410214784890865.
[I 2025-02-28 18:33:36,115] Trial 4 finished with value: 0.5378743214204842 and parameters: {'n_estimators': 309, 'learning_rate': 0.17421406214155305, 'max_depth': 3, 'num_leaves': 113, 'min_child_samples': 34, 'subsample': 0.9738099891070344, 'colsample_bytree': 0.7562528414871743, 'reg_alpha': 5.133529252218217, 'reg_lambda': 2.11155294172321}. Best is trial 1 with value: 0.5410214784890865.
[I 2025-02-28 18:33:39,233] Trial 5 finished with value: 0.5215350745032434 and parameters: {'n_estimators': 341, 'learning_rate': 0.18767238417574686, 'max_depth': 10, 'num_leaves': 136, 'min_child_samples': 66, 'subsample': 0.8867151507232669, 'colsample_bytree': 0.9092772966637461, 'reg_alpha': 6.138544864189051, 'reg_lambda': 6.242063310000051}. Best is trial 1 with value: 0.5410214784890865.
[I 2025-02-28 18:33:42,612] Trial 6 finished with value: 0.5417250030856776 and parameters: {'n_estimators': 502, 'learning_rate': 0.029940907936526262, 'max_depth': 5, 'num_leaves': 112, 'min_child_samples': 37, 'subsample': 0.8965774632044667, 'colsample_bytree': 0.706188547974648, 'reg_alpha': 9.858552429397744, 'reg_lambda': 6.876734019667184}. Best is trial 6 with value: 0.5417250030856776.
[I 2025-02-28 18:33:47,924] Trial 7 finished with value: 0.5215763778434364 and parameters: {'n_estimators': 796, 'learning_rate': 0.12560450508861962, 'max_depth': 6, 'num_leaves': 42, 'min_child_samples': 21, 'subsample': 0.6066766343884689, 'colsample_bytree': 0.6659956073752201, 'reg_alpha': 5.889095924211835, 'reg_lambda': 5.0554002307174954}. Best is trial 6 with value: 0.5417250030856776.
[I 2025-02-28 18:33:52,614] Trial 8 finished with value: 0.5092863005557293 and parameters: {'n_estimators': 977, 'learning_rate': 0.27120086458914267, 'max_depth': 8, 'num_leaves': 29, 'min_child_samples': 79, 'subsample': 0.9243835981489068, 'colsample_bytree': 0.7532166091964823, 'reg_alpha': 8.9896729330966, 'reg_lambda': 5.630390304010492}. Best is trial 6 with value: 0.5417250030856776.
[I 2025-02-28 18:33:56,688] Trial 9 finished with value: 0.5025441274070831 and parameters: {'n_estimators': 728, 'learning_rate': 0.2917512797636476, 'max_depth': 9, 'num_leaves': 150, 'min_child_samples': 28, 'subsample': 0.6336649862655541, 'colsample_bytree': 0.5572172130464979, 'reg_alpha': 5.527068948583429, 'reg_lambda': 7.285684249055668}. Best is trial 6 with value: 0.5417250030856776.
[I 2025-02-28 18:33:59,809] Trial 10 finished with value: 0.5397143085926579 and parameters: {'n_estimators': 394, 'learning_rate': 0.023293553559949526, 'max_depth': 7, 'num_leaves': 195, 'min_child_samples': 9, 'subsample': 0.7886885096146093, 'colsample_bytree': 0.8865561891282174, 'reg_alpha': 9.37212168472918, 'reg_lambda': 9.180762799566521}. Best is trial 6 with value: 0.5417250030856776.
[I 2025-02-28 18:34:00,543] Trial 11 finished with value: 0.5440497774562261 and parameters: {'n_estimators': 169, 'learning_rate': 0.0575100237763993, 'max_depth': 4, 'num_leaves': 65, 'min_child_samples': 53, 'subsample': 0.806500640439573, 'colsample_bytree': 0.8368286546246911, 'reg_alpha': 7.5362380941522025, 'reg_lambda': 3.7573596446273814}. Best is trial 11 with value: 0.5440497774562261.
[I 2025-02-28 18:34:01,227] Trial 12 finished with value: 0.5246318104414452 and parameters: {'n_estimators': 104, 'learning_rate': 0.020000160544995246, 'max_depth': 5, 'num_leaves': 46, 'min_child_samples': 52, 'subsample': 0.514173455947583, 'colsample_bytree': 0.9982699830405768, 'reg_alpha': 7.371816306408743, 'reg_lambda': 0.29518524208796215}. Best is trial 11 with value: 0.5440497774562261.
[I 2025-02-28 18:34:02,712] Trial 13 finished with value: 0.5411925234210939 and parameters: {'n_estimators': 463, 'learning_rate': 0.0764391252126574, 'max_depth': 3, 'num_leaves': 105, 'min_child_samples': 58, 'subsample': 0.8589779586148867, 'colsample_bytree': 0.8593960774380958, 'reg_alpha': 3.349327734980789, 'reg_lambda': 8.825646631511722}. Best is trial 11 with value: 0.5440497774562261.
[I 2025-02-28 18:34:03,766] Trial 14 finished with value: 0.5404643214473739 and parameters: {'n_estimators': 232, 'learning_rate': 0.07265748179368517, 'max_depth': 5, 'num_leaves': 65, 'min_child_samples': 52, 'subsample': 0.7541433203124146, 'colsample_bytree': 0.8225882980665425, 'reg_alpha': 7.808907338965442, 'reg_lambda': 3.893214122646761}. Best is trial 11 with value: 0.5440497774562261.
[I 2025-02-28 18:34:06,271] Trial 15 finished with value: 0.5414486987170964 and parameters: {'n_estimators': 500, 'learning_rate': 0.03921614899461118, 'max_depth': 5, 'num_leaves': 119, 'min_child_samples': 68, 'subsample': 0.8487149098935276, 'colsample_bytree': 0.6525639278829782, 'reg_alpha': 9.854851648190847, 'reg_lambda': 8.237653827910606}. Best is trial 11 with value: 0.5440497774562261.
[I 2025-02-28 18:34:07,070] Trial 16 finished with value: 0.5443086205520119 and parameters: {'n_estimators': 260, 'learning_rate': 0.05354527857411969, 'max_depth': 4, 'num_leaves': 181, 'min_child_samples': 14, 'subsample': 0.9034527304964098, 'colsample_bytree': 0.9994159970452017, 'reg_alpha': 7.570311497851004, 'reg_lambda': 1.3405041995581706}. Best is trial 16 with value: 0.5443086205520119.
[I 2025-02-28 18:34:07,759] Trial 17 finished with value: 0.5295824853867158 and parameters: {'n_estimators': 228, 'learning_rate': 0.22438058810292544, 'max_depth': 4, 'num_leaves': 170, 'min_child_samples': 12, 'subsample': 0.7304901298156259, 'colsample_bytree': 0.9983176321159564, 'reg_alpha': 3.5530664186589047, 'reg_lambda': 0.3217559236443166}. Best is trial 16 with value: 0.5443086205520119.
[I 2025-02-28 18:34:08,327] Trial 18 finished with value: 0.542659887557329 and parameters: {'n_estimators': 227, 'learning_rate': 0.1074330533270406, 'max_depth': 3, 'num_leaves': 200, 'min_child_samples': 5, 'subsample': 0.8127746284476145, 'colsample_bytree': 0.9166221827780916, 'reg_alpha': 7.493120614100051, 'reg_lambda': 1.6881099685242342}. Best is trial 16 with value: 0.5443086205520119.
[I 2025-02-28 18:34:10,478] Trial 19 finished with value: 0.5314217033148203 and parameters: {'n_estimators': 380, 'learning_rate': 0.08711656962220418, 'max_depth': 7, 'num_leaves': 171, 'min_child_samples': 46, 'subsample': 0.9970050654981224, 'colsample_bytree': 0.9489513574602209, 'reg_alpha': 3.9791562588878326, 'reg_lambda': 3.3760319121733873}. Best is trial 16 with value: 0.5443086205520119.
[I 2025-02-28 18:34:11,014] Trial 20 finished with value: 0.5450057029343796 and parameters: {'n_estimators': 160, 'learning_rate': 0.06003691234250976, 'max_depth': 4, 'num_leaves': 58, 'min_child_samples': 22, 'subsample': 0.9435407081247855, 'colsample_bytree': 0.817209088334033, 'reg_alpha': 0.008716293076002302, 'reg_lambda': 1.6787101201614192}. Best is trial 20 with value: 0.5450057029343796.
[I 2025-02-28 18:34:13,453] Trial 21 finished with value: 0.5445812148281128 and parameters: {'n_estimators': 178, 'learning_rate': 0.051386159440118845, 'max_depth': 4, 'num_leaves': 57, 'min_child_samples': 23, 'subsample': 0.9217250689320882, 'colsample_bytree': 0.820165779592411, 'reg_alpha': 0.12838244355625866, 'reg_lambda': 1.7191303037263683}. Best is trial 20 with value: 0.5450057029343796.
[I 2025-02-28 18:34:19,634] Trial 22 finished with value: 0.5425661132752796 and parameters: {'n_estimators': 290, 'learning_rate': 0.04511697178277499, 'max_depth': 3, 'num_leaves': 21, 'min_child_samples': 19, 'subsample': 0.9267869847901363, 'colsample_bytree': 0.7994204834613143, 'reg_alpha': 0.038883105269208976, 'reg_lambda': 1.3753824302594748}. Best is trial 20 with value: 0.5450057029343796.
[I 2025-02-28 18:34:20,065] Trial 23 finished with value: 0.5456215903300075 and parameters: {'n_estimators': 105, 'learning_rate': 0.08836732558896855, 'max_depth': 4, 'num_leaves': 45, 'min_child_samples': 21, 'subsample': 0.9333867867274527, 'colsample_bytree': 0.9562564045513104, 'reg_alpha': 0.4431230557718911, 'reg_lambda': 1.0592541089500838}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:20,890] Trial 24 finished with value: 0.5405766869803859 and parameters: {'n_estimators': 112, 'learning_rate': 0.08746086685679198, 'max_depth': 6, 'num_leaves': 46, 'min_child_samples': 27, 'subsample': 0.9520788371967963, 'colsample_bytree': 0.9440650557990548, 'reg_alpha': 0.046171908853950505, 'reg_lambda': 2.244366121149166}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:21,918] Trial 25 finished with value: 0.535510589956807 and parameters: {'n_estimators': 199, 'learning_rate': 0.14901860687126928, 'max_depth': 5, 'num_leaves': 61, 'min_child_samples': 25, 'subsample': 0.8674362826129792, 'colsample_bytree': 0.8644110156075968, 'reg_alpha': 1.3753040510819154, 'reg_lambda': 0.13728504436942557}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:23,554] Trial 26 finished with value: 0.49995194757989825 and parameters: {'n_estimators': 160, 'learning_rate': 0.011295858507771685, 'max_depth': 4, 'num_leaves': 33, 'min_child_samples': 19, 'subsample': 0.9299751072228464, 'colsample_bytree': 0.7995893745529109, 'reg_alpha': 0.8271453383923701, 'reg_lambda': 0.9031907867979468}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:24,618] Trial 27 finished with value: 0.54136394751594 and parameters: {'n_estimators': 323, 'learning_rate': 0.09626595448073069, 'max_depth': 3, 'num_leaves': 88, 'min_child_samples': 30, 'subsample': 0.9938516134243033, 'colsample_bytree': 0.7976554971033438, 'reg_alpha': 2.405818067427746, 'reg_lambda': 2.5967127499683675}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:26,927] Trial 28 finished with value: 0.5201181333296389 and parameters: {'n_estimators': 427, 'learning_rate': 0.13194627344054186, 'max_depth': 6, 'num_leaves': 54, 'min_child_samples': 19, 'subsample': 0.9519086905723467, 'colsample_bytree': 0.9601844464005312, 'reg_alpha': 2.4385442493846776, 'reg_lambda': 3.045706229465728}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:31,801] Trial 29 finished with value: 0.4915880380404 and parameters: {'n_estimators': 600, 'learning_rate': 0.2163074058331411, 'max_depth': 6, 'num_leaves': 77, 'min_child_samples': 5, 'subsample': 0.8404272741760088, 'colsample_bytree': 0.5791174303702202, 'reg_alpha': 0.763131541151066, 'reg_lambda': 0.9138907938127938}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:32,783] Trial 30 finished with value: 0.5410774429353168 and parameters: {'n_estimators': 103, 'learning_rate': 0.05767722540963362, 'max_depth': 7, 'num_leaves': 96, 'min_child_samples': 42, 'subsample': 0.7550155301064561, 'colsample_bytree': 0.8841437378831947, 'reg_alpha': 1.823241040317661, 'reg_lambda': 4.8248189525894345}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:33,878] Trial 31 finished with value: 0.5447656530021804 and parameters: {'n_estimators': 254, 'learning_rate': 0.05808400877680589, 'max_depth': 4, 'num_leaves': 127, 'min_child_samples': 13, 'subsample': 0.8942519817661881, 'colsample_bytree': 0.9252226926952034, 'reg_alpha': 0.7700580765795652, 'reg_lambda': 1.3743962462567594}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:37,584] Trial 32 finished with value: 0.5429212425552031 and parameters: {'n_estimators': 174, 'learning_rate': 0.11050905579021411, 'max_depth': 4, 'num_leaves': 124, 'min_child_samples': 11, 'subsample': 0.88453908363, 'colsample_bytree': 0.911772275522064, 'reg_alpha': 0.5370022210437813, 'reg_lambda': 1.8440603852294066}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:39,240] Trial 33 finished with value: 0.5436706983035731 and parameters: {'n_estimators': 269, 'learning_rate': 0.06645477750923878, 'max_depth': 4, 'num_leaves': 72, 'min_child_samples': 23, 'subsample': 0.9529348589427186, 'colsample_bytree': 0.8425531247283318, 'reg_alpha': 2.1041811447376135, 'reg_lambda': 2.6945853442668843}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:39,925] Trial 34 finished with value: 0.5438783037552988 and parameters: {'n_estimators': 153, 'learning_rate': 0.037608832076791034, 'max_depth': 5, 'num_leaves': 134, 'min_child_samples': 15, 'subsample': 0.9169017549885521, 'colsample_bytree': 0.7105740579595161, 'reg_alpha': 1.1707672136006777, 'reg_lambda': 0.9705638843869945}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:40,428] Trial 35 finished with value: 0.5427154958498155 and parameters: {'n_estimators': 205, 'learning_rate': 0.08515035737680028, 'max_depth': 3, 'num_leaves': 99, 'min_child_samples': 35, 'subsample': 0.976577307146639, 'colsample_bytree': 0.9573780085655689, 'reg_alpha': 0.3574671209336951, 'reg_lambda': 2.097426660996745}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:41,574] Trial 36 finished with value: 0.5387107169068805 and parameters: {'n_estimators': 371, 'learning_rate': 0.10110259261973088, 'max_depth': 4, 'num_leaves': 88, 'min_child_samples': 30, 'subsample': 0.8374301871968041, 'colsample_bytree': 0.7818658554621556, 'reg_alpha': 1.0959483138214856, 'reg_lambda': 0.5995854012783639}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:43,202] Trial 37 finished with value: 0.5396968012491913 and parameters: {'n_estimators': 138, 'learning_rate': 0.12088174637805628, 'max_depth': 5, 'num_leaves': 56, 'min_child_samples': 41, 'subsample': 0.8796684297084698, 'colsample_bytree': 0.8863794724274013, 'reg_alpha': 2.9878992702263107, 'reg_lambda': 1.4572644846649254}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:43,869] Trial 38 finished with value: 0.5393594910745998 and parameters: {'n_estimators': 299, 'learning_rate': 0.17243087009432634, 'max_depth': 3, 'num_leaves': 39, 'min_child_samples': 16, 'subsample': 0.9594673375870353, 'colsample_bytree': 0.7432995022921125, 'reg_alpha': 4.1951285950019335, 'reg_lambda': 4.526524021636865}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:44,343] Trial 39 finished with value: 0.541503334093693 and parameters: {'n_estimators': 195, 'learning_rate': 0.04820279391746983, 'max_depth': 3, 'num_leaves': 23, 'min_child_samples': 34, 'subsample': 0.9089925422230793, 'colsample_bytree': 0.9157543526042334, 'reg_alpha': 1.6204116800140465, 'reg_lambda': 9.999165404425318}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:47,204] Trial 40 finished with value: 0.5402542217710835 and parameters: {'n_estimators': 891, 'learning_rate': 0.06386722099224824, 'max_depth': 4, 'num_leaves': 152, 'min_child_samples': 86, 'subsample': 0.9414246587526459, 'colsample_bytree': 0.8296867708818856, 'reg_alpha': 0.5138705769701981, 'reg_lambda': 2.9784502461790403}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:48,729] Trial 41 finished with value: 0.5445184518016682 and parameters: {'n_estimators': 265, 'learning_rate': 0.051112577038867596, 'max_depth': 4, 'num_leaves': 184, 'min_child_samples': 13, 'subsample': 0.8960916492228412, 'colsample_bytree': 0.9964304875925206, 'reg_alpha': 4.6547945352067925, 'reg_lambda': 1.0359297664635108}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:49,912] Trial 42 finished with value: 0.5442188536569572 and parameters: {'n_estimators': 258, 'learning_rate': 0.026715282361536048, 'max_depth': 4, 'num_leaves': 80, 'min_child_samples': 9, 'subsample': 0.8958234486894006, 'colsample_bytree': 0.9748527037126143, 'reg_alpha': 6.4933232426704475, 'reg_lambda': 0.037582154011755575}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:52,167] Trial 43 finished with value: 0.5427754113678465 and parameters: {'n_estimators': 340, 'learning_rate': 0.034722111409205815, 'max_depth': 5, 'num_leaves': 143, 'min_child_samples': 23, 'subsample': 0.9765221722257463, 'colsample_bytree': 0.9287061540163946, 'reg_alpha': 0.06984775689116524, 'reg_lambda': 2.399014779413541}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:53,066] Trial 44 finished with value: 0.49648814294248966 and parameters: {'n_estimators': 145, 'learning_rate': 0.01039799077016123, 'max_depth': 4, 'num_leaves': 51, 'min_child_samples': 17, 'subsample': 0.8667329383623265, 'colsample_bytree': 0.972684707238554, 'reg_alpha': 4.667484362671332, 'reg_lambda': 1.1404983522664032}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:54,907] Trial 45 finished with value: 0.5310781320211311 and parameters: {'n_estimators': 249, 'learning_rate': 0.07799566079346314, 'max_depth': 9, 'num_leaves': 35, 'min_child_samples': 7, 'subsample': 0.776795657756453, 'colsample_bytree': 0.8664037375368475, 'reg_alpha': 0.9866982748253421, 'reg_lambda': 1.9054159115405902}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:55,763] Trial 46 finished with value: 0.5444301348331371 and parameters: {'n_estimators': 191, 'learning_rate': 0.049908023287768236, 'max_depth': 4, 'num_leaves': 70, 'min_child_samples': 31, 'subsample': 0.7064742151056554, 'colsample_bytree': 0.9325350699218569, 'reg_alpha': 2.8681911029061515, 'reg_lambda': 0.6917947083432978}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:56,244] Trial 47 finished with value: 0.5389823110294107 and parameters: {'n_estimators': 100, 'learning_rate': 0.06892935443455488, 'max_depth': 3, 'num_leaves': 182, 'min_child_samples': 24, 'subsample': 0.8287928806960181, 'colsample_bytree': 0.6905499586878077, 'reg_alpha': 1.5501157097420872, 'reg_lambda': 1.6076542318068407}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:35:03,267] Trial 48 finished with value: 0.5292613035876131 and parameters: {'n_estimators': 659, 'learning_rate': 0.09668139637933358, 'max_depth': 5, 'num_leaves': 129, 'min_child_samples': 38, 'subsample': 0.6361764555366813, 'colsample_bytree': 0.9699246977527767, 'reg_alpha': 1.9624990748351678, 'reg_lambda': 3.4702833737538317}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:35:05,061] Trial 49 finished with value: 0.525258471414587 and parameters: {'n_estimators': 298, 'learning_rate': 0.1327156660852742, 'max_depth': 6, 'num_leaves': 158, 'min_child_samples': 12, 'subsample': 0.9313406143953251, 'colsample_bytree': 0.8903897279513846, 'reg_alpha': 6.816620129550619, 'reg_lambda': 5.469460667657804}. Best is trial 23 with value: 0.5456215903300075.
Best trial: 23
Best R2 score: 0.5456
Best parameters: {'n_estimators': 105, 'learning_rate': 0.08836732558896855, 'max_depth': 4, 'num_leaves': 45, 'min_child_samples': 21, 'subsample': 0.9333867867274527, 'colsample_bytree': 0.9562564045513104, 'reg_alpha': 0.4431230557718911, 'reg_lambda': 1.0592541089500838}
Out[165]:
LGBMRegressor(colsample_bytree=0.9562564045513104,
              learning_rate=0.08836732558896855, max_depth=4,
              min_child_samples=21, n_estimators=105, num_leaves=45,
              reg_alpha=0.4431230557718911, reg_lambda=1.0592541089500838,
              subsample=0.9333867867274527)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LGBMRegressor(colsample_bytree=0.9562564045513104,
              learning_rate=0.08836732558896855, max_depth=4,
              min_child_samples=21, n_estimators=105, num_leaves=45,
              reg_alpha=0.4431230557718911, reg_lambda=1.0592541089500838,
              subsample=0.9333867867274527)
In [166]:
tune_with_optuna('gbr')
[I 2025-02-28 18:37:07,435] A new study created in memory with name: no-name-bff5797b-c4c7-420e-8423-0eb272ac4979
Tuning gbr with Optuna...
[I 2025-02-28 18:37:17,518] Trial 0 finished with value: 0.5166258331501118 and parameters: {'n_estimators': 371, 'learning_rate': 0.23149802485829724, 'max_depth': 5, 'min_samples_split': 8, 'min_samples_leaf': 6, 'subsample': 0.8654801770757944, 'max_features': 'sqrt'}. Best is trial 0 with value: 0.5166258331501118.
[I 2025-02-28 18:37:31,525] Trial 1 finished with value: 0.5011830303200927 and parameters: {'n_estimators': 597, 'learning_rate': 0.16297507958502855, 'max_depth': 6, 'min_samples_split': 16, 'min_samples_leaf': 5, 'subsample': 0.9385872347320672, 'max_features': 'sqrt'}. Best is trial 0 with value: 0.5166258331501118.
[I 2025-02-28 18:37:49,758] Trial 2 finished with value: 0.46754387432227185 and parameters: {'n_estimators': 581, 'learning_rate': 0.13266622970545963, 'max_depth': 8, 'min_samples_split': 15, 'min_samples_leaf': 3, 'subsample': 0.8769286659977769, 'max_features': 'log2'}. Best is trial 0 with value: 0.5166258331501118.
[I 2025-02-28 18:37:59,836] Trial 3 finished with value: 0.48447491534964604 and parameters: {'n_estimators': 258, 'learning_rate': 0.21758901231806224, 'max_depth': 9, 'min_samples_split': 17, 'min_samples_leaf': 6, 'subsample': 0.6567767542664944, 'max_features': 'log2'}. Best is trial 0 with value: 0.5166258331501118.
[I 2025-02-28 18:38:33,678] Trial 4 finished with value: 0.48657479842045426 and parameters: {'n_estimators': 694, 'learning_rate': 0.14420086005981206, 'max_depth': 7, 'min_samples_split': 11, 'min_samples_leaf': 6, 'subsample': 0.6442223861292229, 'max_features': 'log2'}. Best is trial 0 with value: 0.5166258331501118.
[I 2025-02-28 18:38:51,121] Trial 5 finished with value: 0.513709258108584 and parameters: {'n_estimators': 729, 'learning_rate': 0.1880798784796406, 'max_depth': 5, 'min_samples_split': 3, 'min_samples_leaf': 10, 'subsample': 0.7923205103911215, 'max_features': 'sqrt'}. Best is trial 0 with value: 0.5166258331501118.
[I 2025-02-28 18:39:05,886] Trial 6 finished with value: 0.39744537459852963 and parameters: {'n_estimators': 482, 'learning_rate': 0.2874701249273012, 'max_depth': 10, 'min_samples_split': 9, 'min_samples_leaf': 4, 'subsample': 0.8650893517864041, 'max_features': 'log2'}. Best is trial 0 with value: 0.5166258331501118.
[I 2025-02-28 18:39:09,982] Trial 7 finished with value: 0.5275963120649791 and parameters: {'n_estimators': 173, 'learning_rate': 0.15440197329999186, 'max_depth': 6, 'min_samples_split': 15, 'min_samples_leaf': 2, 'subsample': 0.9524441160270005, 'max_features': 'sqrt'}. Best is trial 7 with value: 0.5275963120649791.
[I 2025-02-28 18:39:30,982] Trial 8 finished with value: 0.4447553338956525 and parameters: {'n_estimators': 772, 'learning_rate': 0.14481270830876722, 'max_depth': 9, 'min_samples_split': 2, 'min_samples_leaf': 5, 'subsample': 0.8492243714089422, 'max_features': 'sqrt'}. Best is trial 7 with value: 0.5275963120649791.
[I 2025-02-28 18:39:50,694] Trial 9 finished with value: 0.5009965748591971 and parameters: {'n_estimators': 904, 'learning_rate': 0.09770195111574961, 'max_depth': 6, 'min_samples_split': 2, 'min_samples_leaf': 4, 'subsample': 0.845242795990188, 'max_features': 'sqrt'}. Best is trial 7 with value: 0.5275963120649791.
[I 2025-02-28 18:39:54,616] Trial 10 finished with value: 0.4856570793039988 and parameters: {'n_estimators': 100, 'learning_rate': 0.0158195981360717, 'max_depth': 3, 'min_samples_split': 20, 'min_samples_leaf': 1, 'subsample': 0.5534182536566528, 'max_features': None}. Best is trial 7 with value: 0.5275963120649791.
[I 2025-02-28 18:40:00,484] Trial 11 finished with value: 0.5322821113025176 and parameters: {'n_estimators': 249, 'learning_rate': 0.24880540941081658, 'max_depth': 4, 'min_samples_split': 7, 'min_samples_leaf': 8, 'subsample': 0.9793200911044663, 'max_features': 'sqrt'}. Best is trial 11 with value: 0.5322821113025176.
[I 2025-02-28 18:40:06,964] Trial 12 finished with value: 0.5341450618057997 and parameters: {'n_estimators': 143, 'learning_rate': 0.2841875556141446, 'max_depth': 3, 'min_samples_split': 6, 'min_samples_leaf': 9, 'subsample': 0.9978996570520571, 'max_features': None}. Best is trial 12 with value: 0.5341450618057997.
[I 2025-02-28 18:40:21,958] Trial 13 finished with value: 0.5191535779959852 and parameters: {'n_estimators': 313, 'learning_rate': 0.2993437119857678, 'max_depth': 3, 'min_samples_split': 6, 'min_samples_leaf': 9, 'subsample': 0.9944523744088988, 'max_features': None}. Best is trial 12 with value: 0.5341450618057997.
[I 2025-02-28 18:40:37,685] Trial 14 finished with value: 0.5056679911090216 and parameters: {'n_estimators': 418, 'learning_rate': 0.2471412387561913, 'max_depth': 4, 'min_samples_split': 6, 'min_samples_leaf': 8, 'subsample': 0.9982987291810923, 'max_features': None}. Best is trial 12 with value: 0.5341450618057997.
[I 2025-02-28 18:40:45,036] Trial 15 finished with value: 0.5092445560008925 and parameters: {'n_estimators': 245, 'learning_rate': 0.2639713069965241, 'max_depth': 4, 'min_samples_split': 5, 'min_samples_leaf': 8, 'subsample': 0.7141662972999565, 'max_features': None}. Best is trial 12 with value: 0.5341450618057997.
[I 2025-02-28 18:40:52,710] Trial 16 finished with value: 0.5267751128450314 and parameters: {'n_estimators': 184, 'learning_rate': 0.2084217747264252, 'max_depth': 4, 'min_samples_split': 11, 'min_samples_leaf': 8, 'subsample': 0.9252964221661318, 'max_features': None}. Best is trial 12 with value: 0.5341450618057997.
[I 2025-02-28 18:40:56,885] Trial 17 finished with value: 0.5376825710642157 and parameters: {'n_estimators': 106, 'learning_rate': 0.2671548773506419, 'max_depth': 3, 'min_samples_split': 8, 'min_samples_leaf': 10, 'subsample': 0.7880943235243648, 'max_features': None}. Best is trial 17 with value: 0.5376825710642157.
[I 2025-02-28 18:41:00,748] Trial 18 finished with value: 0.5416257967246378 and parameters: {'n_estimators': 123, 'learning_rate': 0.09423480329576486, 'max_depth': 3, 'min_samples_split': 9, 'min_samples_leaf': 10, 'subsample': 0.512923259349313, 'max_features': None}. Best is trial 18 with value: 0.5416257967246378.
[I 2025-02-28 18:41:13,212] Trial 19 finished with value: 0.5208678147341531 and parameters: {'n_estimators': 462, 'learning_rate': 0.07019810766489495, 'max_depth': 5, 'min_samples_split': 10, 'min_samples_leaf': 10, 'subsample': 0.5168346995942698, 'max_features': None}. Best is trial 18 with value: 0.5416257967246378.
[I 2025-02-28 18:41:22,515] Trial 20 finished with value: 0.5416957030088807 and parameters: {'n_estimators': 361, 'learning_rate': 0.07369596953659807, 'max_depth': 3, 'min_samples_split': 13, 'min_samples_leaf': 10, 'subsample': 0.7440250200014602, 'max_features': None}. Best is trial 20 with value: 0.5416957030088807.
[I 2025-02-28 18:41:33,415] Trial 21 finished with value: 0.5426331486901319 and parameters: {'n_estimators': 338, 'learning_rate': 0.06477196411506755, 'max_depth': 3, 'min_samples_split': 13, 'min_samples_leaf': 10, 'subsample': 0.7787886752931414, 'max_features': None}. Best is trial 21 with value: 0.5426331486901319.
[I 2025-02-28 18:41:44,917] Trial 22 finished with value: 0.5428936591876395 and parameters: {'n_estimators': 347, 'learning_rate': 0.06064214039077384, 'max_depth': 3, 'min_samples_split': 12, 'min_samples_leaf': 9, 'subsample': 0.7127842938125919, 'max_features': None}. Best is trial 22 with value: 0.5428936591876395.
[I 2025-02-28 18:42:03,260] Trial 23 finished with value: 0.543469553667053 and parameters: {'n_estimators': 353, 'learning_rate': 0.039199470874611626, 'max_depth': 4, 'min_samples_split': 14, 'min_samples_leaf': 9, 'subsample': 0.739062694687006, 'max_features': None}. Best is trial 23 with value: 0.543469553667053.
[I 2025-02-28 18:42:21,265] Trial 24 finished with value: 0.5457570109135942 and parameters: {'n_estimators': 532, 'learning_rate': 0.013201142547601463, 'max_depth': 4, 'min_samples_split': 13, 'min_samples_leaf': 7, 'subsample': 0.6777821390980873, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:42:42,385] Trial 25 finished with value: 0.5439261216932461 and parameters: {'n_estimators': 532, 'learning_rate': 0.010039986736273668, 'max_depth': 5, 'min_samples_split': 13, 'min_samples_leaf': 7, 'subsample': 0.6498643321838303, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:43:01,640] Trial 26 finished with value: 0.5436549026412915 and parameters: {'n_estimators': 527, 'learning_rate': 0.012657542037501569, 'max_depth': 5, 'min_samples_split': 18, 'min_samples_leaf': 7, 'subsample': 0.632831158101171, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:43:39,045] Trial 27 finished with value: 0.5436108167392069 and parameters: {'n_estimators': 648, 'learning_rate': 0.010479491492067202, 'max_depth': 5, 'min_samples_split': 19, 'min_samples_leaf': 7, 'subsample': 0.6263160862609665, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:44:17,342] Trial 28 finished with value: 0.5188927330135028 and parameters: {'n_estimators': 526, 'learning_rate': 0.03144435154272926, 'max_depth': 7, 'min_samples_split': 18, 'min_samples_leaf': 7, 'subsample': 0.6064182077680091, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:44:50,139] Trial 29 finished with value: 0.5318818151914331 and parameters: {'n_estimators': 528, 'learning_rate': 0.037078760979773806, 'max_depth': 5, 'min_samples_split': 17, 'min_samples_leaf': 7, 'subsample': 0.6793627926448832, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:45:17,226] Trial 30 finished with value: 0.5200594442228811 and parameters: {'n_estimators': 828, 'learning_rate': 0.10160343797110957, 'max_depth': 5, 'min_samples_split': 14, 'min_samples_leaf': 6, 'subsample': 0.5913224517175929, 'max_features': 'log2'}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:45:46,037] Trial 31 finished with value: 0.5437248279671836 and parameters: {'n_estimators': 642, 'learning_rate': 0.010718707706394888, 'max_depth': 5, 'min_samples_split': 20, 'min_samples_leaf': 7, 'subsample': 0.6192044818969257, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:46:20,067] Trial 32 finished with value: 0.5279056342257883 and parameters: {'n_estimators': 648, 'learning_rate': 0.024977632144370873, 'max_depth': 6, 'min_samples_split': 20, 'min_samples_leaf': 7, 'subsample': 0.5747001517631624, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:46:40,594] Trial 33 finished with value: 0.521898238645211 and parameters: {'n_estimators': 562, 'learning_rate': 0.053136898230836574, 'max_depth': 5, 'min_samples_split': 18, 'min_samples_leaf': 5, 'subsample': 0.6810347705111421, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:47:22,430] Trial 34 finished with value: 0.5012564856508492 and parameters: {'n_estimators': 616, 'learning_rate': 0.044852191212167415, 'max_depth': 7, 'min_samples_split': 16, 'min_samples_leaf': 6, 'subsample': 0.5526377301388273, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:47:45,550] Trial 35 finished with value: 0.5404149482117955 and parameters: {'n_estimators': 480, 'learning_rate': 0.012788846087546015, 'max_depth': 6, 'min_samples_split': 16, 'min_samples_leaf': 7, 'subsample': 0.6614972336753118, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:47:52,161] Trial 36 finished with value: 0.5356853105269188 and parameters: {'n_estimators': 421, 'learning_rate': 0.11546072593103919, 'max_depth': 4, 'min_samples_split': 19, 'min_samples_leaf': 5, 'subsample': 0.6306622817556439, 'max_features': 'log2'}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:48:27,370] Trial 37 finished with value: 0.5238012786375147 and parameters: {'n_estimators': 703, 'learning_rate': 0.027445701082624965, 'max_depth': 6, 'min_samples_split': 17, 'min_samples_leaf': 6, 'subsample': 0.7075722781366109, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:48:54,141] Trial 38 finished with value: 0.4803893172806074 and parameters: {'n_estimators': 585, 'learning_rate': 0.08241134882850128, 'max_depth': 7, 'min_samples_split': 15, 'min_samples_leaf': 7, 'subsample': 0.6158320947239292, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:49:12,312] Trial 39 finished with value: 0.5191469540618101 and parameters: {'n_estimators': 765, 'learning_rate': 0.04867681569444588, 'max_depth': 8, 'min_samples_split': 19, 'min_samples_leaf': 8, 'subsample': 0.6556495911466975, 'max_features': 'log2'}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:49:27,121] Trial 40 finished with value: 0.51445861942503 and parameters: {'n_estimators': 996, 'learning_rate': 0.17604560515527518, 'max_depth': 4, 'min_samples_split': 12, 'min_samples_leaf': 5, 'subsample': 0.6865704822754182, 'max_features': 'sqrt'}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:49:49,832] Trial 41 finished with value: 0.5428321400737083 and parameters: {'n_estimators': 656, 'learning_rate': 0.012314841065014846, 'max_depth': 5, 'min_samples_split': 19, 'min_samples_leaf': 7, 'subsample': 0.6321577435982716, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:50:16,816] Trial 42 finished with value: 0.5429624565039538 and parameters: {'n_estimators': 630, 'learning_rate': 0.012421853867381209, 'max_depth': 5, 'min_samples_split': 20, 'min_samples_leaf': 6, 'subsample': 0.5769075212947131, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:50:40,490] Trial 43 finished with value: 0.5305765917532641 and parameters: {'n_estimators': 532, 'learning_rate': 0.025803873387967642, 'max_depth': 6, 'min_samples_split': 18, 'min_samples_leaf': 7, 'subsample': 0.6127969490955832, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:51:07,869] Trial 44 finished with value: 0.5169077552483072 and parameters: {'n_estimators': 664, 'learning_rate': 0.05177198947130677, 'max_depth': 5, 'min_samples_split': 16, 'min_samples_leaf': 4, 'subsample': 0.6490376456361933, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:51:17,510] Trial 45 finished with value: 0.5335218764322195 and parameters: {'n_estimators': 439, 'learning_rate': 0.010319344258733786, 'max_depth': 4, 'min_samples_split': 19, 'min_samples_leaf': 6, 'subsample': 0.5415114067883431, 'max_features': 'sqrt'}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:51:36,072] Trial 46 finished with value: 0.5419270218451692 and parameters: {'n_estimators': 576, 'learning_rate': 0.028425282082328862, 'max_depth': 5, 'min_samples_split': 14, 'min_samples_leaf': 8, 'subsample': 0.5953717473890078, 'max_features': 'log2'}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:52:16,879] Trial 47 finished with value: 0.5161883235424825 and parameters: {'n_estimators': 720, 'learning_rate': 0.037834002926623006, 'max_depth': 6, 'min_samples_split': 17, 'min_samples_leaf': 8, 'subsample': 0.6315956912901397, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:52:42,653] Trial 48 finished with value: 0.5399822816152737 and parameters: {'n_estimators': 515, 'learning_rate': 0.021419953513921924, 'max_depth': 5, 'min_samples_split': 20, 'min_samples_leaf': 6, 'subsample': 0.6687823840080587, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:53:18,400] Trial 49 finished with value: 0.4661191543056004 and parameters: {'n_estimators': 750, 'learning_rate': 0.12102882067315053, 'max_depth': 10, 'min_samples_split': 18, 'min_samples_leaf': 7, 'subsample': 0.704352712615413, 'max_features': 'sqrt'}. Best is trial 24 with value: 0.5457570109135942.
Best trial: 24
Best R2 score: 0.5458
Best parameters: {'n_estimators': 532, 'learning_rate': 0.013201142547601463, 'max_depth': 4, 'min_samples_split': 13, 'min_samples_leaf': 7, 'subsample': 0.6777821390980873, 'max_features': None}
Out[166]:
GradientBoostingRegressor(learning_rate=0.013201142547601463, max_depth=4,
                          min_samples_leaf=7, min_samples_split=13,
                          n_estimators=532, subsample=0.6777821390980873)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingRegressor(learning_rate=0.013201142547601463, max_depth=4,
                          min_samples_leaf=7, min_samples_split=13,
                          n_estimators=532, subsample=0.6777821390980873)

GradientBoostingRegressor(min_samples_leaf=4, min_samples_split=10,random_state=42, subsample=0.8)

XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=1.0, device=None, early_stopping_rounds=None,enable_categorical=False, eval_metric=None, feature_types=None, gamma=0.2, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.05, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,max_delta_step=None, max_depth=3, max_leaves=None,min_child_weight=3, missing=nan, monotone_constraints=None,multi_strategy=None, n_estimators=500, n_jobs=None,num_parallel_tree=None, random_state=42, ...)

LGBMRegressor(colsample_bytree=0.6, learning_rate=0.05, max_depth=3,min_child_samples=5, n_estimators=500, num_leaves=50,random_state=42, reg_alpha=10, reg_lambda=1, subsample=0.6,verbose=-1)

optima

XGBRegressor(base_score=None, booster=None, callbacks=None,colsample_bylevel=None, colsample_bynode=None,colsample_bytree=0.7494749285293014, device=None,early_stopping_rounds=None, enable_categorical=False,eval_metric=None, feature_types=None, gamma=0.9450617049891935,grow_policy=None, importance_type=None,interaction_constraints=None, learning_rate=0.025181939608234893,max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,max_delta_step=None, max_depth=4, max_leaves=None,min_child_weight=2, missing=nan, monotone_constraints=None,multi_strategy=None, n_estimators=403, n_jobs=None,num_parallel_tree=None, random_state=None)

LGBMRegressor(colsample_bytree=0.9562564045513104,learning_rate=0.08836732558896855, max_depth=4,min_child_samples=21, n_estimators=105, num_leaves=45,reg_alpha=0.4431230557718911, reg_lambda=1.0592541089500838,subsample=0.9333867867274527)

GradientBoostingRegressor(learning_rate=0.013201142547601463, max_depth=4,min_samples_leaf=7, min_samples_split=13,n_estimators=532, subsample=0.6777821390980873)

In [177]:
# Define a dictionary of regression models
regression_models_3 = {
    "GradientBoosting_Tuned_2": GradientBoostingRegressor(min_samples_leaf=4, min_samples_split=10,random_state=42, subsample=0.8),
    "XGBoost_Tuned_2": xgb.XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=1.0, device=None, early_stopping_rounds=None,enable_categorical=False, eval_metric=None, feature_types=None, gamma=0.2, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.05, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,max_delta_step=None, max_depth=3, max_leaves=None,min_child_weight=3, monotone_constraints=None,multi_strategy=None, n_estimators=500, n_jobs=None,num_parallel_tree=None, random_state=42),
    "LightGBM_Tuned_2": lgb.LGBMRegressor(colsample_bytree=0.6, learning_rate=0.05, max_depth=3,min_child_samples=5, n_estimators=500, num_leaves=50,random_state=42, reg_alpha=10, reg_lambda=1, subsample=0.6,verbose=-1),
    "GradientBoosting_Optima": GradientBoostingRegressor(learning_rate=0.013201142547601463, max_depth=4,min_samples_leaf=7, min_samples_split=13,n_estimators=532, subsample=0.6777821390980873),
    "XGBoost_Tuned_Optima": xgb.XGBRegressor(base_score=None, booster=None, callbacks=None,colsample_bylevel=None, colsample_bynode=None,colsample_bytree=0.7494749285293014, device=None,early_stopping_rounds=None, enable_categorical=False,eval_metric=None, feature_types=None, gamma=0.9450617049891935,grow_policy=None, importance_type=None,interaction_constraints=None, learning_rate=0.025181939608234893,max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,max_delta_step=None, max_depth=4, max_leaves=None,min_child_weight=2, monotone_constraints=None,multi_strategy=None, n_estimators=403, n_jobs=None,num_parallel_tree=None, random_state=None),
    "LightGBM_Tuned_Optima": lgb.LGBMRegressor(colsample_bytree=0.9562564045513104,learning_rate=0.08836732558896855, max_depth=4,min_child_samples=21, n_estimators=105, num_leaves=45,reg_alpha=0.4431230557718911, reg_lambda=1.0592541089500838,subsample=0.9333867867274527,verbose=-1)
}
In [178]:
# Initialize an empty DataFrame to store results
results_df_3 = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])
In [179]:
%%time
# Loop through each model, train it, evaluate it, and store results
for model_name, model in regression_models_3.items():
    model.fit(X_train, y_train)
    metrics = evaluate_model(model, X_test, y_test)
    metrics["Model"] = model_name  # Add model name for reference
    results_df_3 = pd.concat([results_df_3, pd.DataFrame([metrics])], ignore_index=True)
CPU times: total: 9.75 s
Wall time: 10.6 s
In [180]:
# Display the results DataFrame
# Tuned Models
results_df_3.sort_values(by="R2 Score", ascending=False)
Out[180]:
Model MAE MSE RMSE R2 Score
4 XGBoost_Tuned_Optima 190.967571 62720.926919 250.441464 0.551006
3 GradientBoosting_Optima 190.906116 62859.987800 250.718942 0.550011
0 GradientBoosting_Tuned_2 191.385545 62994.960983 250.987970 0.549045
5 LightGBM_Tuned_Optima 191.011681 63051.358720 251.100296 0.548641
1 XGBoost_Tuned_2 191.250406 63130.225737 251.257290 0.548076
2 LightGBM_Tuned_2 191.907873 63205.786577 251.407610 0.547535
In [181]:
# Display the results DataFrame
# Advanced Models
results_df_2.sort_values(by="R2 Score", ascending=False)
Out[181]:
Model MAE MSE RMSE R2 Score
2 GradientBoosting_Tuned_1 191.041670 63583.720750 252.158126 0.544830
4 LightGBM_Tuned_1 191.308692 63653.691737 252.296833 0.544329
3 XGBoost_Tuned_1 191.172298 63903.217757 252.790858 0.542543
1 RandomForest_Tuned_1 192.136590 64281.579837 253.538123 0.539834
5 NeuralNetwork(MLP) 201.391227 66855.660295 258.564615 0.521407
0 DecisionTree_Tuned_1 197.919800 70045.064728 264.660282 0.498576
In [182]:
# Display the results DataFrame
# Regression Models
results_df.sort_values(by="R2 Score", ascending=False)
Out[182]:
Model MAE MSE RMSE R2 Score
0 Linear Regression 203.946460 68259.489536 261.265171 0.511358
2 Ridge Regression 203.959557 68260.799293 261.267677 0.511349
1 Lasso Regression 205.169206 68709.866263 262.125669 0.508134
4 Random Forest 203.172925 74554.912118 273.047454 0.466292
5 K-Nearest Neighbors 204.794534 74634.435609 273.193037 0.465722
6 Support Vector Regressor 216.052970 89802.139881 299.670052 0.357143
3 Decision Tree 234.987992 106660.526170 326.589232 0.236460
  • After model tunning, another slight improvement was achieved
  • The best R2 Score from the tuned models is currently 0.5510 with the XGBoost_Tuned_Optima model.
  • The best R2 Score achieved is low, and still suggest that the models are not explaining a significant portion of the variance in the target variable.

Revisit¶

  • So far, various regression models were tested (Linear, Lasso, Ridge, Decision Tree, Random Forest, KNN).
  • Feature selection and transformation steps were applied.
  • Hyperparameter tuning was performed for some models (Random Forest, Gradient Boosting, XGBoost, LightGBM
  • Optuna was used for advanced hyperparameter tuning.
  • Despite all this, the R² score remained low at 0.55
  • In this section, a diferent approach will be tested aiming to reach better results.
In [193]:
# Outlier handling
df7=df4[(np.abs(df4.select_dtypes(include=np.number).apply(zscore))<3).all(axis=1)] #drop over 3 standard deviations
count_outliers(df7)
price: 1345 outliers (8.67%)
rooms: 236 outliers (1.52%)
bathroom: 0 outliers (0.00%)
square_meters: 899 outliers (5.80%)
square_meters_price: 897 outliers (5.78%)
Out[193]:
3377
In [194]:
df7.shape
Out[194]:
(15511, 9)
  • On this new approach, not all outliers will be removed.
  • Only will be removed those over 3 standard deviations
  • The remaining data will be considered as valid for modeling information, due it captures high areas or luxury units
In [195]:
df7.to_csv('df_WITHOUT OUTLIERS 3SD_DATA.csv', index=False)  # Save a copy of data after outliers handling new approach
In [196]:
# Load data
data = df7.copy()

# Create dummy variables for categorical features with specified baseline categories
data = pd.get_dummies(data, columns=['real_state', 'neighborhood'], drop_first=False)
for feature, baseline in {'real_state': "flat", 'neighborhood': "Eixample"}.items():
        if f"{feature}_{baseline}" in data.columns:
            data.drop(columns=[f"{feature}_{baseline}"], inplace=True)
  • Feature selection done considering "real_state_flat" and "neighborhood_Eixample" as the base line categories for one-hot encoding
In [197]:
# Convert boolean columns to numeric (0 and 1)
bool_cols = data.select_dtypes(['bool']).columns
data[bool_cols] = data[bool_cols].astype(int)
    
In [198]:
univariate_numerical(data)
No description has been provided for this image
In [199]:
data.to_csv('df_MODELING_DATA.csv', index=False)  # Save a copy of data ready for modeling
In [201]:
data.columns
Out[201]:
Index(['price', 'rooms', 'bathroom', 'lift', 'terrace', 'square_meters',
       'square_meters_price', 'real_state_apartment', 'real_state_attic',
       'real_state_study', 'neighborhood_Ciutat Vella', 'neighborhood_Gràcia',
       'neighborhood_Horta- Guinardo', 'neighborhood_Les Corts',
       'neighborhood_Nou Barris', 'neighborhood_Sant Andreu',
       'neighborhood_Sant Martí', 'neighborhood_Sants-Montjuïc',
       'neighborhood_Sarria-Sant Gervasi'],
      dtype='object')
In [200]:
# Drop 'square_meter_price' from features
X = data.drop(columns=['price','square_meters_price'])
y = data['price']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature engineering
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_poly)
X_test_scaled = scaler.transform(X_test_poly)

# Stacking Regressor
base_models = [
    ('ridge', Ridge(alpha=1.0)),
    ('lasso', Lasso(alpha=0.1)),
    ('svr', SVR(kernel='rbf'))
]
meta_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1)
stacking_model = StackingRegressor(estimators=base_models, final_estimator=meta_model)

# Cross-validation
cv_scores = cross_val_score(stacking_model, X_train_scaled, y_train, cv=5, scoring='r2')
print("Mean R2 Score from Cross Validation:", np.mean(cv_scores))

# Train and Evaluate
stacking_model.fit(X_train_scaled, y_train)
y_pred = stacking_model.predict(X_test_scaled)
Mean R2 Score from Cross Validation: 0.5842273828818654
  • Polynomial features are derived by raising existing numerical features to a power (e.g., squared, cubic) and creating interaction terms between different features. This extends the linear model to capture non-linear relationships in the data
  • Adding polynomial and interaction terms can help the model learn more complex relationships between features, improving performance
  • If Housing prices are influenced by complex interactions between features like square meters, number of rooms, and location, the a linear model might fail to capture these nuances.
  • Stacking is an ensemble learning technique that combines multiple base models to make better predictions. It works in two main stages: a) Train base models independently: Several regressors (e.g., Random Forest, XGBoost, LightGBM) make individual predictions. b) Meta-model learns from base model outputs: A final estimator (often a linear model or another tree-based model) takes the predictions from the base models as inputs and learns to optimize the final prediction
  • While individual models may overfit, the stacking regressor generalizes better by learning which model performs best in different scenarios
  • Achieved R2 Score 0.58, still low
  • Final aproach will be done considering Feature Engineering Enhancements like Log Transformation for Skewed Data and Polynomial & Interaction Features, Staking Modeling will be applied but testing other base models.

Final Modeling¶

In [203]:
%%time
target = "price"
features = [col for col in data.columns if col not in [target, "square_meter_price"]]

X = data[features]
y = data[target]

# Apply Log Transformation to Reduce Skewness
y = np.log1p(y)

# Create Polynomial & Interaction Features
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_poly = poly.fit_transform(X)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)

# Standardize Features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define Base Models
rf = RandomForestRegressor(n_estimators=300, max_depth=20, min_samples_split=5, random_state=42)
xgbr = xgb.XGBRegressor(n_estimators=300, max_depth=10, learning_rate=0.05, random_state=42)
lgbr = lgb.LGBMRegressor(n_estimators=300, max_depth=10, learning_rate=0.05, random_state=42)

# Stacking Model
stacked_model = StackingRegressor(
    estimators=[("rf", rf), ("xgb", xgbr), ("lgb", lgbr)],
    final_estimator=xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42)
)

# Train Model
stacked_model.fit(X_train, y_train)

# Evaluate Model
r2_score = stacked_model.score(X_test, y_test)
print(f"Improved R² Score: {r2_score:.4f}")
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008351 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 7725
[LightGBM] [Info] Number of data points in the train set: 12408, number of used features: 121
[LightGBM] [Info] Start training from score 7.050622
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007901 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 7407
[LightGBM] [Info] Number of data points in the train set: 9926, number of used features: 119
[LightGBM] [Info] Start training from score 7.049734
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006163 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 7372
[LightGBM] [Info] Number of data points in the train set: 9926, number of used features: 119
[LightGBM] [Info] Start training from score 7.049923
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006475 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 7391
[LightGBM] [Info] Number of data points in the train set: 9926, number of used features: 121
[LightGBM] [Info] Start training from score 7.052482
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007244 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 7372
[LightGBM] [Info] Number of data points in the train set: 9927, number of used features: 118
[LightGBM] [Info] Start training from score 7.051116
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006298 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 7386
[LightGBM] [Info] Number of data points in the train set: 9927, number of used features: 119
[LightGBM] [Info] Start training from score 7.049856
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
Improved R² Score: 0.9830
CPU times: total: 9min 52s
Wall time: 8min 51s
  • Applyed Log Transformation to Reduce Skewness. This transformation helps normalize right-skewed distributions, making the data more symmetrical and better suited for linear models.

  • The dataset's 'price' distribution (as seen in the histogram) is highly skewed, and many machine learning models (like linear regression and tree-based models) perform better with normally distributed data.

  • Applying the log transformation reduces the effect of extreme values (e.g., luxury properties with abnormally high prices) and improves the model's ability to capture general trends.

  • Decision trees (Random Forest), gradient boosting (XGBoost, LightGBM) each have unique strengths in handling structured data. Stacking allows leveraging multiple perspectives.

  • Whit this implementation it was achieved an improved R² Score: 0.9830

Evaluation Consolidated Notes¶

Regression Models

  • Models to be tested are : Linear Regression, Lasso Regression, Ridge Regression, Decision Tree, Random Forest, K-Nearest Neighbors, and Support Vector Regressor
  • Performance Metrics:
    • MAE (Mean Absolute Error): Measures the average magnitude of errors in a set of predictions, without considering their direction.
    • MSE (Mean Squared Error): Measures the average of the squares of the errors, giving more weight to larger errors.
    • RMSE (Root Mean Squared Error): The square root of MSE, providing error in the same units as the target variable.
    • R2 Score (Coefficient of Determination): Indicates how well the model's predictions approximate the real data points. A value closer to 1 indicates a better fit.
  • Random Forest metrics: Lowest MAE, lowest RMSE, and highest R².
  • Random Forest is the best performer overall, indicating strong predictive accuracy and low error.
  • Decision Tree metrics: Moderate errors with a good R².
  • Decision Tree is a strong candidate, although slightly behind Random Forest.
  • Ridge, Linear, and Lasso Regression metrics are consistent with each other, but their performance is noticeably lower than the tree-based methods. They might not be ideal for further tuning if the goal is the best predictive performance.
  • For hyperparameter tuning and further validation, Random Forest and Decision Tree stand out as the best candidates due to their superior performance metrics.
  • While the linear models (Ridge, Linear, and Lasso) can serve as strong baselines, they do not match the predictive accuracy of the tree-based models.
  • K-Nearest Neighbors and SVR appear less promising for further development on this dataset.

Feature Engineering

  • From the feature importance plot, square_meters is the most significant variable, followed by square_meters_price.
  • Since price is directly derived from square_meters * square_meters_price, including both may not add new information and could introduce redundancy.
  • It makes no sence to ask end user square_meters and square_meters_price to "predict" price.
  • NEW MODELS will be evaluated, with the feature square_meters_price DROPED from the data
  • Although its VIF (1.568) is low (suggesting no strong collinearity within the dataset), the mathematical dependence between square_meters and square_meters_price suggests redundancy.
  • This means the model could overestimate the importance of one feature over another and lead to unstable coefficient estimates.
  • By keeping only square_meters, the model remains more interpretable, focusing on how space affects price rather than a derived variable.
  • Noted features 'rooms' and 'bathroom' present high multicolinearity and will be also droped from modeling
  • Defined function "preprocess_data(data, target_feature, drop_features, scale_features, test_size=0.30, random_state=1)", to iterate on the data preparation for modeling
  • Data preparation droping the feature square_meters_price
  • Linear Regression and Ridge Regression performed the best in terms of R² Score.
  • Feature selection will be performed to reduce multicollinearity.
  • Data preparation droping the feature 'rooms' due high multicolinearity
  • After removing feature 'rooms' still Linear Regression and Ridge Regression performed the best in terms of R² Score, but also remains features with high multicolinearity
  • Data preparation droping the feature 'bahtroom' due high multicolinearity
  • Remains the feature real_state_flat with VIF>5
  • Since "flat" is the most frequent category across neighborhoods, it might be highly correlated with certain neighborhood variables.
  • Instead of removing real_state_flat, it will be considered as the Baseline Category for real_state
  • Modified preprocess_data function to control one-hot encoding category to drop
  • Selected real_state_flat and neighborhood_Eixample as the base line categories for one-hot encoding
  • There is no multicolinearity in the data, suggesting the real state distribution in terms of number of rooms and bathrooms is not as relevant as the real state area, type and neighborhood
  • Linear Regression and Ridge Regression are the best models among those tested, but R² scores suggest that the models are not explaining a significant portion of the variance in the target variable.
  • More advanced models will be included in the evaluation

Advanced Regression Models

  • Models to be tested are: DecisionTree_Tuned_1, RandomForest_Tuned_1, GradientBoosting_Tuned_1, XGBoost_Tuned_1, LightGBM_Tuned_1, NeuralNetwork(MLP)
  • The best R2 score from the advanced models is currently 0.5448584 with the Gradient Boosting model.
  • Improving from 0.5113 Linear Regression could be a good start, but could potentially be improved further with model tuning

Model Tuning

  • After model tunning, another slight improvement was achieved
  • The best R2 Score from the tuned models is currently 0.549055 with the XGBoost_Tuned_Optima model.
  • The best R2 Score achieved is low, and still suggest that the models are not explaining a significant portion of the variance in the target variable.

Revisit

  • So far, various regression models were tested (Linear, Lasso, Ridge, Decision Tree, Random Forest, KNN).
  • Feature selection and transformation steps were applied.
  • Hyperparameter tuning was performed for some models (Random Forest, Gradient Boosting, XGBoost, LightGBM
  • Optuna was used for advanced hyperparameter tuning.
  • Despite all this, the R² score remained low
  • In this section, a diferent approach will be tested aiming to reach better results.
  • On this new approach, not all outliers will be removed.
  • Only will be removed those over 3 standard deviations
  • The remaining data will be considered as valid for modeling information, due it captures high areas or luxury units
  • Feature selection done considering "real_state_flat" and "neighborhood_Eixample" as the base line categories for one-hot encoding
  • Polynomial features are derived by raising existing numerical features to a power (e.g., squared, cubic) and creating interaction terms between different features. This extends the linear model to capture non-linear relationships in the data
  • Adding polynomial and interaction terms can help the model learn more complex relationships between features, improving performance
  • If Housing prices are influenced by complex interactions between features like square meters, number of rooms, and location, the a linear model might fail to capture these nuances.
  • Stacking is an ensemble learning technique that combines multiple base models to make better predictions. It works in two main stages: a) Train base models independently: Several regressors (e.g., Random Forest, XGBoost, LightGBM) make individual predictions. b) Meta-model learns from base model outputs: A final estimator (often a linear model or another tree-based model) takes the predictions from the base models as inputs and learns to optimize the final prediction
  • While individual models may overfit, the stacking regressor generalizes better by learning which model performs best in different scenarios
  • Achieved R2 Score 0.58, still low
  • Final aproach will be done considering Feature Engineering Enhancements like Log Transformation for Skewed Data and Polynomial & Interaction Features, Staking Modeling will be applied but testing other base models.

Final Modeling

  • Applyed Log Transformation to Reduce Skewness. This transformation helps normalize right-skewed distributions, making the data more symmetrical and better suited for linear models.
  • The dataset's 'price' distribution (as seen in the histogram) is highly skewed, and many machine learning models (like linear regression and tree-based models) perform better with normally distributed data.
  • Applying the log transformation reduces the effect of extreme values (e.g., luxury properties with abnormally high prices) and improves the model's ability to capture general trends.
  • Decision trees (Random Forest), gradient boosting (XGBoost, LightGBM) each have unique strengths in handling structured data. Stacking allows leveraging multiple perspectives.
  • Whit this implementation it was achieved an improved R² Score: 0.9830

7. Deployment¶

Implementing the model in a production environment, making it accessible for real-world use. This might involve integrating the model with existing systems or deploying it via APIs or cloud platforms.

In [211]:
# Create models directory if it doesn't exist
os.makedirs("models", exist_ok=True)

# Get current date
current_date = datetime.datetime.now().strftime("%Y-%m-%d")

# Export the best model
file_name = f"models/stacked_model_at_{current_date}.pkl"
joblib.dump(stacked_model, file_name)

# Export feature transformers
scaler_name = f"models/scaler_at_{current_date}.pkl"
joblib.dump(scaler, scaler_name)

poly_name = f"models/poly_at_{current_date}.pkl"
joblib.dump(poly, poly_name)

print("Models saved successfully!")
Models saved successfully!
  • Best model achieved and related feature transformation files where saved
  • Created a folder named "models" for models files saving
  • Files saved with a filename that includes today's date for versioning purposes
  • Model deployment code will be held on a separated Jupiter Nodebook using Stramlit
  • Streamlit is a Python framework for building interactive web applications for machine learning models.
  • The new file will be named "PEA_AIDS_PROJECT2_UI.py" and will load the model and build a user interface using Streamlit

8. Monitoring and Maintenance¶

Continuously monitoring the model's performance in production to ensure its accuracy and relevance over time. This stage may also involve retraining the model as new data becomes available.

  • Sumarized code for model testing on new data when available
  • New data should be without missing values before modeling, and have to mantain features structure and naming
  • A separated file named "model_retrain.py" created to run the same modeling when new data become available

9. Communication and Reporting¶

Presenting findings and results to stakeholders in a clear and actionable manner, often through dashboards, visualizations, or reports.

  • Data quality is fundamental to get optimal results.
  • Considering the data quality issues (missing values and outliers) and limitations (only 10 neighbourhoods), is suggested to evaluate automated ways for data collection (i.e. web scraping) for better quality and wider data.
  • The variable "square_meters_price" was not conidered for modeling, due the asumption of no modeling required to predict "price" if that variable is known together with the variable "square_meters".
  • To enable the model to learn the complex relationships between features on this dataset, modeling considered:
    • Feature Engineering Enhancements
    • Log Transformation for Skewed Data
    • Polynomial & Interaction Features
    • Modeling with XGBoost, LightGBM, and Stacking
  • To interact with uses, it was created a dedicated script (PEA_AIDS_PROJECT2_UI.py) that runs a local web app where users can input values and get predictions.
  • It is important to monitor the model's performance. If required to retrain the model, this can be done by a dedicated code (model_retrain.py).
  • For sharing and visualization is made available an html version of this code
In [1]:
!jupyter nbconvert --to html PROJECT2_PYTHON.ipynb
C:\Users\otroc\AppData\Local\Programs\Python\Python313\Scripts\jupyter-nbconvert.EXE\__main__.py:4: DeprecationWarning: Parsing dates involving a day of month without a year specified is ambiguious
and fails to parse leap day. The default behavior will change in Python 3.15
to either always raise an exception or to use a different default year (TBD).
To avoid trouble, add a specific year to the input & format.
See https://github.com/python/cpython/issues/70647.
[NbConvertApp] Converting notebook PROJECT2_PYTHON.ipynb to html
[NbConvertApp] WARNING | Alternative text is missing on 19 image(s).
[NbConvertApp] Writing 11793001 bytes to PROJECT2_PYTHON.html
In [ ]: